Icinga2: Re-notification interval per service basis

Icinga2: Re-notification interval per service basis - monitoring

The default re-notification interval is 30m, and it can be changed via the Notification object. But it will affect all the services.
I want to set re-notification interval to 5m for critical service and disable re-notification for low priority, leave the default 30m for rest of the services.
Found a similar discussion here, but no solution yet:
https://www.reddit.com/r/icinga/comments/73uc8s/setting_notification_interval_icingaweb2/

Found an indirect method to achieve this, using custom vars defined under Service object, and access them via Notification object.
A sample config is given below:
apply Service "service1" {
# service conf goes here
vars.notification.interval = 5m
}
apply Service "service2" {
# service conf goes here
vars.notification.interval = 2h
}
apply Service "service3" {
# service conf goes here
vars.notification.interval = 0
}
apply Service "service4" {
# service conf goes here
}
apply Notification "notifications1" to Service {
# notification conf goes here
interval = (service.vars.notification.interval) || 20m
}
In the above example, the re-notification interval as follows:
service1: 5 minutes
service2: 2 hours
service3: Notify once, no re-notificaiton
service4: 20 minutes (System default is 30m, here we modified the default to 20 minutes)
Explanation:
interval = (service.vars.notification.interval) || 20m
The value for the variable interval will be set to service.vars.notification.interval if it is present, else set to 20m

Related

How to prevent Quartz.Net misfired job from retrying

I know I can specify .WithMisfireHandlingInstructionDoNothing() when building the trigger but some of my jobs are triggered via the IScheduler.TriggerJob() method, so without any triggers.
I can detect and log misfires in the ITriggerListener listener but how can I stop Quartz from trying to fire the job again? If I understand correctly .VetoJobExecution is not usable since the job has to be triggered successfully anyway.
Any other ideas?
Edit: my implementation
JobDataMap jobData = new JobDataMap(data);
IJobDetail jobTemplate = await jobScheduler.GetJobDetail(jobKey);
var jobTrigger = TriggerBuilder.Create()
.ForJob(jobTemplate)
.UsingJobData(jobData)
.WithSimpleSchedule(s => s.WithRepeatCount(0).WithMisfireHandlingInstructionNextWithRemainingCount())
.StartNow()
.Build();
await jobScheduler.ScheduleJob(jobTrigger);

Well if you just want the TriggerJob behavior you can achieve that just by adding one simple trigger to scheduler that is going to trigger immediately and configure retry policy for that. So if you can change the call sites of TriggerJob to create a simple trigger instead (maybe an extension method that allows to define the policy), the Quartz source for TriggerJob is here.

I think I've got it.
You have to use the WithMisfireHandlingInstructionNextWithRemainingCount policy, and WithRepeatCount(0).
But the trick here is to set the right value in the IScheduler MisfireThreshold to the misfire to be considered as a real misfire. I mean, if your misfire threshold is set to 60 seconds (default) then any job not executed in less than 60 seconds from schedule, will NOT be considered as a misfire. And so, the misfire policy will not be used.
In my case, with 3 second duration jobs, I had to set the WithMisfireThreshold to 1 second or less.
This way, the job is not retried if "really misfired".

How to control whether a resource exists in Terraform?

I am building with Terraform a Cloud Scheduler job that will hit a service deployed in Cloud Run. Because the service and the scheduler are deployed in different pipelines, it is possible that the service does not exist yet when running the scheduler pipeline. This is why I am using data "google_cloud_run_service" to retrieve the service and then control whether it exists in the scheduler block. It is here that I don't find the right syntax to use in count.
data "google_cloud_run_service" "run-service" {
name = "serv-${var.project_env}"
location = var.region
}
resource "google_cloud_scheduler_job" "job" {
count = length(data.google_cloud_run_service.run-service.status)
name = "snap-job-${var.project_env}"
description = "Call write API on my service"
schedule = "* * * * *"
time_zone = "Etc/UTC"
http_target {
http_method = "GET"
uri = "${data.google_cloud_run_service.run-service.status[0].url}/write"
oidc_token {
service_account_email = google_service_account.sa_scheduler.email
}
}
depends_on = [google_app_engine_application.app]
}
The above length(data.google_cloud_run_service.run-service.status) control will not make any effect, and Terraform tries to create the scheduler even though there is no service defined.
I have also tried other variations with similar result such as length(data.google_cloud_run_service.run-service.status[0]) > 0 ? 1 : 0.
Other options that I tried will not work either with different errors:
data.google_cloud_run_service.run-service ? 1 : 0: data.google_cloud_run_service.run-service is object with 9 attributes;The condition expression must be of type bool
data.google_cloud_run_service.run-service.status[0].url ? 1 : 0: data.google_cloud_run_service.run-service is object with 9 attributes;The condition expression must be of type bool

How to start a job 1 hour after I click Build

In Jenkins, I want to be able to delay the execution of a job, i.e. for 1 hour, after I click Build.
I want to set up the parameters, click Build and the job to stay in the Queue for 1 hour, without using an executor, and then start. I do not want to schedule a job periodically or sth like that. Just to force it to stay in the queue for a certain (maybe configurable) period of time.
Is there a way to do this?

Using sleep here is exactly the right thing to do. To workaround the fact that no new nodes are consumed and idle, just execute the sleep command on a special sleepnode. So every Job will use this sleepnode for sleeping, which means no additional nodes would get consumed.
node('sleeper') {
stage('Countdown 1hr') {
sleep 3600
}
}
node('buildnode1') {
stage('build') {
echo "What ever should be done here"
}
}

How to set max_run_time for a specific job?

I want to set Delayed::Worker.max_run_time = 1.hour for a specific job that I know will take a while. However, this is set as a global configuration in initializers/delayed_job_config.rb. As a result, this change will make ALL of my jobs have a max run time of 1 hour. Is there a way to just change it for one specific job without creating a custom job?

Looking at the Worker class on GitHub:
def run(job)
job_say job, 'RUNNING'
runtime = Benchmark.realtime do
Timeout.timeout(self.class.max_run_time.to_i, WorkerTimeout) { job.invoke_job }
job.destroy
end
job_say job, 'COMPLETED after %.4f' % runtime
return true # did work
rescue DeserializationError => error
job.last_error = "#{error.message}\n#{error.backtrace.join("\n")}"
failed(job)
rescue Exception => error
self.class.lifecycle.run_callbacks(:error, self, job){ handle_failed_job(job, error) }
return false # work failed
end
It doesn't appear that you can set a per-job max. But I would think you could roll your own timeout, in your job. Assuming the Timeout class allows nesting! Worth a try.
class MyLongJobClass
def perform
Timeout.timeout(1.hour.to_i, WorkerTimeout) { do_perform }
end
private
def do_perform
# ... real perform work
end
end

You can now set a per job max run time, but it must be lower than the global constant.
To set a per-job max run time that overrides the Delayed::Worker.max_run_time you can define a max_run_time method on the job
NOTE: this can ONLY be used to set a max_run_time that is lower than
Delayed::Worker.max_run_time. Otherwise the lock on the job would
expire and another worker would start the working on the in progress
job.
I have a parent Job class where I set max_run_time to 10 minutes. Then override that method for the one that I want to be really long. Then set the global constant to be really long as well.

Searchable index gets locked on manual update (LockObtainFailedException)

We have a Grails project that runs behind a load balancer. There are three instances of the Grails application running on the server (using separate Tomcat instances). Each instance has its own searchable index. Because the indexes are separate, the automatic update is not enough keeping the index consistent between the application instances. Because of this we have disabled the searchable index mirroring and updates to the index are done manually in a scheduled quartz job. According to our understanding no other part of the application should modify the index.
The quartz job runs once a minute and it checks from the database which rows have been updated by the application, and re-indexes those objects. The job also checks if the same job is already running so it doesn’t do any concurrent indexing. The application runs fine for few hours after the startup and then suddenly when the job is starting, LockObtainFailedException is thrown:
22.10.2012 11:20:40 [xxxx.ReindexJob] ERROR Could not update searchable index, class org.compass.core.engine.SearchEngineException:
Failed to open writer for sub index [product]; nested exception is
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
out:
SimpleFSLock#/home/xxx/tomcat/searchable-index/index/product/lucene-a7bbc72a49512284f5ac54f5d7d32849-write.lock
According to the log the last time the job was executed, re-indexing was done without any errors and the job finished successfully. Still, this time the re-index operation throws the locking exception, as if the previous operation was unfinished and the lock had not been released. The lock will not be released until the application is restarted.
We tried to solve the problem by manually opening the locked index, which causes the following error to be printed to the log:
22.10.2012 11:21:30 [manager.IndexWritersManager ] ERROR Illegal state, marking an index writer as open, while another is marked as
open for sub index [product]
After this the job seems to be working correctly and doesn’t become stuck in a locked state again. However this causes the application to constantly use 100 % of the CPU resource. Below is a shortened version of the quartz job code.
Any help would be appreciated to solve the problem, thanks in advance.
class ReindexJob {
def compass
...
static Calendar lastIndexed
static triggers = {
// Every day every minute (at xx:xx:30), start delay 2 min
// cronExpression: "s m h D M W [Y]"
cron name: "ReindexTrigger", cronExpression: "30 * * * * ?", startDelay: 120000
}
def execute() {
if (ConcurrencyHelper.isLocked(ConcurrencyHelper.Locks.LUCENE_INDEX)) {
log.error("Search index has been locked, not doing anything.")
return
}
try {
boolean acquiredLock = ConcurrencyHelper.lock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
if (!acquiredLock) {
log.warn("Could not lock search index, not doing anything.")
return
}
Calendar reindexDate = lastIndexed
Calendar newReindexDate = Calendar.instance
if (!reindexDate) {
reindexDate = Calendar.instance
reindexDate.add(Calendar.MINUTE, -3)
lastIndexed = reindexDate
}
log.debug("+++ Starting ReindexJob, last indexed ${TextHelper.formatDate("yyyy-MM-dd HH:mm:ss", reindexDate.time)} +++")
Long start = System.currentTimeMillis()
String reindexMessage = ""
// Retrieve the ids of products that have been modified since the job last ran
String productQuery = "select p.id from Product ..."
List<Long> productIds = Product.executeQuery(productQuery, ["lastIndexedDate": reindexDate.time, "lastIndexedCalendar": reindexDate])
if (productIds) {
reindexMessage += "Found ${productIds.size()} product(s) to reindex. "
final int BATCH_SIZE = 10
Long time = TimeHelper.timer {
for (int inserted = 0; inserted < productIds.size(); inserted += BATCH_SIZE) {
log.debug("Indexing from ${inserted + 1} to ${Math.min(inserted + BATCH_SIZE, productIds.size())}: ${productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size()))}")
Product.reindex(productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size())))
Thread.sleep(250)
}
}
reindexMessage += " (${time / 1000} s). "
} else {
reindexMessage += "No products to reindex. "
}
log.debug(reindexMessage)
// Re-index brands
Brand.reindex()
lastIndexed = newReindexDate
log.debug("+++ Finished ReindexJob (${(System.currentTimeMillis() - start) / 1000} s) +++")
} catch (Exception e) {
log.error("Could not update searchable index, ${e.class}: ${e.message}")
if (e instanceof org.apache.lucene.store.LockObtainFailedException || e instanceof org.compass.core.engine.SearchEngineException) {
log.info("This is a Lucene index locking exception.")
for (String subIndex in compass.searchEngineIndexManager.getSubIndexes()) {
if (compass.searchEngineIndexManager.isLocked(subIndex)) {
log.info("Releasing Lucene index lock for sub index ${subIndex}")
compass.searchEngineIndexManager.releaseLock(subIndex)
}
}
}
} finally {
ConcurrencyHelper.unlock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
}
}
}
Based on JMX CPU samples, it seems that Compass is doing some scheduling behind the scenes. From 1 minute CPU samples it seems like there are few things different when normal and 100% CPU instances are compared:
org.apache.lucene.index.IndexWriter.doWait() is using most of the CPU time.
Compass Scheduled Executor Thread is shown in the thread list, this was not seen in a normal situation.
One Compass Executor Thread is doing commitMerge, in a normal situation none of these threads was doing commitMerge.

You can try increasing the 'compass.transaction.lockTimeout' setting. The default is 10 (seconds).
Another option is to disable concurrency in Compass and make it synchronous. This is controlled with the 'compass.transaction.processor.read_committed.concurrentOperations': 'false' setting. You might also have to set 'compass.transaction.processor' to 'read_committed'
These are the compass settings we are currently using:
compassSettings = [
'compass.engine.optimizer.schedule.period': '300',
'compass.engine.mergeFactor':'1000',
'compass.engine.maxBufferedDocs':'1000',
'compass.engine.ramBufferSize': '128',
'compass.engine.useCompoundFile': 'false',
'compass.transaction.processor': 'read_committed',
'compass.transaction.processor.read_committed.concurrentOperations': 'false',
'compass.transaction.lockTimeout': '30',
'compass.transaction.lockPollInterval': '500',
'compass.transaction.readCommitted.translog.connection': 'ram://'
]
This has concurrency switched off. You can make it asynchronous by changing the 'compass.transaction.processor.read_committed.concurrentOperations' setting to 'true'. (or removing the entry).
Compass configuration reference:
http://static.compassframework.org/docs/latest/core-configuration.html
Documentation for the concurrency of read_committed processor:
http://www.compass-project.org/docs/latest/reference/html/core-searchengine.html#core-searchengine-transaction-read_committed
If you want to keep async operations, you can also control the number of threads it uses. Using compass.transaction.processor.read_committed.concurrencyLevel=1 setting would allow asynchronous operations but just use one thread (the default is 5 threads). There are also the compass.transaction.processor.read_committed.backlog and compass.transaction.processor.read_committed.addTimeout settings.
I hope this helps.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart