Report a metric to Prometheus every hour, or every day - monitoring

I'm using Prometheus to report metrics about my system. I wanted to ask what is the best way to report a counter which is an output of an hourly/daily job.
For example I have an hourly job with a numeric output and I would like to monitor the number and raise an alert if it is under a specific threshold.
Thanks,
Ori

I think what you are looking for is inside the node_exporter collector, if you read the doc you will see a textfile collector option inside it.
If you use cron job, I suggest you store the attended result inside a file and use this collector to get the datas.
You will find a bit more detail about how to do it here: https://www.robustperception.io/using-the-textfile-collector-from-a-shell-script

You can use push gateway and push the metrics into prometheus at end of your hourly / daily job (if it is not running as a service). If it is running as a service i hope you are aware of scrape interval configuration.

Related

Lead time for changes

I am working on some project where i need to generate lead time for changes per application, per day..
Is there any prometheus metric that provides lead time for changes ? and How we integrate it into a grafana dashboard?
There is not going to be a metric or dashboard out of the box for this, the way I would approach this problem is:
You will need to instrument your deployment code with the prometheus client library of your choice. The deployment code will need to grab the commit time, assuming you are using git, you can use git log filtered to the folder that your application is in.
Now that you have the commit date, you can do a date diff between that and the current time (after the app has been deployed to PRD) to get the lead time of X seconds.
To get it into prometheus, use the node_exporter (or windows_exporter) and their textfile collectors to read textfiles that your deployment code writes and surface them for prometheus to scrape. Most of the client libraries have logic to help you write these files, and even if there is not, the format of the textfiles is pretty easy to use by writing the files directly.
You will want to surface this as a gauge metric, and have a label to indicate which application was deployed. The end result will be a single metric that you can query from grafana or set up alerts that will work for any application/folder that you deploy. To mimic the dashboard that you linked to, I am pretty sure you will want to use the over_time functions.
I also want to note that it might be easier for you to store the deployment/lead time in a sql database/something other than prometheus and use that as a data source into grafana. For applications that do not deploy frequently you would easily run into missing series when querying by using prometheus as a datastore, and the overhead of setting up the node_exporters and the logic to manage the textfiles might outweigh the benefits if you can just INSERT into a sql table.

Telegraf, trigger at start and not interval?

Hello i am currently trying to parse a folder of many csv Files(ca. 3GB) into influxdb.
On the influxdata blog it was suggested, that this would be the fastest way since telegraf is written in go.So:
I can get everything to work and i can parse all csvĀ“s and write them to influxdb.
The Problem is that parsing and writing the files takes a lot of time (old macbook..more than an hour i think) and when the agent interval is smaller than the time it takes to write the data, telegraf-agent will start again to read and write all files at the next interval. So it never finishes and my ram gets packed with all the same parsed data over and over. When i set the interval really high i have to wait one interval before the agent starts. So not an option too.
The question is:
Can telegraf be triggert like a script? So that i just run it one time and not have to wait for one interval to start?
The functionality you need has been added since this question was asked. You can now run Telegraf with a --once flag.
I can't find it documented anywhere, but the commit is here.
It's available in v1.15.0-rc1

Any way of monitoring Airflow DAG's execution time?

I'd like to use Airflow with Statsd and DataDog to monitor if DAG takes e.g. twice time as its previous execution(s). So, I need some kind of a real-time timer for a DAG (or operator).
I'm aware that Airflow supports some metrics.
However, to my understanding all of the metrics are related to finished tasks/DAGs, right? So, It's not the solution, because I'd like to monitor running DAGs.
I also considered the timeout_execution/SLA features, but they are not suitable for this use-case
I'd like to be notified that some DAG hangs, but I don't want to kill it.
There are a number of different ways you could handle this:
In the past I've configured a telemetry DAG which would collect the current state of all tasks/DAGs by querying the metadata tables. I'd collect these metrics and push them up to CloudWatch. This became problematic as these internal fields change often so we would run into issues when trying to upgrade to newer versions of Airflow.
There are also some well-maintained Prometheus exporters that some companies have open sourced. By setting these up you could poll the exposed export path as frequently as you wanted to (DataDog supports Prometheus).
These are just some of your options. Since the Airflow webserver is just a Flask app you can really expose metrics in whatever way you see fit.
As I understand, you can monitor running tasks in DAGs using DataDog, refer the integration with Airflow docs
You may refer metrics via DogStatD docs. Also, look at this page would be useful to understand what to monitor.
E.g., the metrics as below:
airflow.operator_failures: monitor the failed operator.
airflow.operator_successes: monitor succeed operator.
airflow.dag_processing.processes: Number of currently running DAG parsing (processes).
airflow.scheduler.tasks.running : Number of tasks running in executor
Shown as task.

Jenkins results: Need a "single" summary report of last run of jobs

We have multiple jenkins-jobs scheduled at roughly near the same time every night.
I would like a report-summary of status to be available to me / or sent to me.
I do not repeatedly want to do a walk through test suite every day.
Much appreciated any advice on topic ?
The Global Build Stats plugin might fit your needs. It does not support scheduled email, but if you need that you could use the rest API it exposes to write your own.

Reserved CPU Time in Google Dataflow

I have a question regarding the reserved CPU time field in Google Dataflow. I don't understand why it varies so widely depending on the configuration of my run. I suspect that I am not interpreting the reserved CPU time for what it really is. To my understanding, it is the CPU time that was needed to complete the job I submitted, but based on the following evidence, it seems I may be mistaken. Is it the time that is allocated to your job, regardless of whether it is actually using the resources? If that's the case, how do I get the actual CPU time of my job?
First I ran my job with a variable sized pool of workers (max 24 workers).
The corresponding stats are as follows:
Then, I ran my script using a fixed number of workers (10):
And the stats changes to:
They went from 15 days to 7 hours? How is that possible?!
Thanks!
If you hover over the "?" next to "Reserved CPU time" a pop-up message will show and it will read: "The total time Dataflow was active on GCE instances, on a per-CPU basis." This indicates it is not the CPU-time used by the VMs. At this time Dataflow does not aggregate per-machine CPU usage stats; you may, however, be able to use the cloud monitoring API to extract those metrics yourself.

Resources