i am using prometheus and grafana to visualize data from a jenkins instance. I am using Jenkins metrics and prometheus metrics plugins to extract metrics for Prometheus, i have created a basic Grafana dashboard for instant metrics and some graphs and right now i need to create a promql query to extract the fluctuation from the last time the metric changes for the build time of a Jenkins job. I found out about changes() and rate() promql function but i don't get the result i am waiting.
The last query that i used was: changes(default_jenkins_builds_last_build_duration_milliseconds{jenkins_job="$project"}[1m])
where the variable $project let me select the job that i need to investigate in Grafana.
is that the right approach ???
do you have any alternative idea ???
Related
I set up a monitoring system to track our Jenkins pipelines using Prometheus and Grafana (see Jenkins Prometheus Plugin). I am building some dashboards and while doing so I tried to create a table graph that displays the 5 slowest pipelines. This is the Prometheus query I used:
topk (5, max by (jenkins_job) (default_jenkins_builds_last_build_duration_milliseconds / 60000))
Grafana Table Visual
However, instead of displaying 5 lines, the table shows numerous timestamps for every pipeline. Does anybody have an idea how to solve this? I tried numerous attemps discribed on stackoverflow (e.g. this), without success. Thanks in advance!
5 slowest Jenkins pipelines, one record each
I've got an ECS service reporting metrics to CloudWatch collected with Codahale Metrics. Some of the metrics are counts, eg count of requests made to an external service. Each service instance maintains and reports to CloudWatch its own count. To my understanding it means the values of the count in CloudWatch are the individuals counts per service without a possibility to see eg the total. If each instance was making 300 requests than the value reported would be 300, with not way to sum it up to 900.
What is the best way to fix it? Is adding an additional dimension with eg ecs task id to the reported CloudWatch metric the way?
I'm graphing the results in Grafana, but likely it's not the important part.
Metrics are already aggregated in Cloudwatch assuming they have the same namespace and name. If these service request metrics are the same, they should be the same metric, then you can add Dimensions to them, such as TaskId, RequestedService or whatever you wanted to aggregate by.
Typically you have the opposite challenge in Cloudwatch Metrics to what you are describing. Metrics are already aggregated together and then you want to drill down to a particular values to debug some issue, such as if you had a problem with a particular container task you would set the dimension TaskId=todo1, or if you suspected a service is down you'd set RequestedService=todo2.
I suspect you are creating a metric for each service you make requests to, instead you only want one metric, and add dimensions to it as described earlier.
Also for this particular use-case you might want to consider open-telemetry/X-Ray which will create for you a service graph and handles the specific case of tracing requests through different services. That does take a bit of effort to setup though.
I'm using Prometheus to report metrics about my system. I wanted to ask what is the best way to report a counter which is an output of an hourly/daily job.
For example I have an hourly job with a numeric output and I would like to monitor the number and raise an alert if it is under a specific threshold.
Thanks,
Ori
I think what you are looking for is inside the node_exporter collector, if you read the doc you will see a textfile collector option inside it.
If you use cron job, I suggest you store the attended result inside a file and use this collector to get the datas.
You will find a bit more detail about how to do it here: https://www.robustperception.io/using-the-textfile-collector-from-a-shell-script
You can use push gateway and push the metrics into prometheus at end of your hourly / daily job (if it is not running as a service). If it is running as a service i hope you are aware of scrape interval configuration.
I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.
I would like to create some statistics for my Selenium UI-Test Job in Jenkins. Calculating the metrics in the maven Job is easy but is there any way to add a graph to the Jenkins Jobs with the numbers I generate?
For example I calculate the average site response time in my UI-Tests and add it with other metrics as an output artifact (e.g. a JSON document). How can I show a graph that displays those metrics over the previous x runs directly inside Jenkins?
I'm not entirely sure this is the correct stack exchange site so point me in the right direction if it isn't.
If it's not some default JUnit reporter (or some other popular reporters), then you can create your own html page and publish using https://wiki.jenkins-ci.org/display/JENKINS/HTML+Publisher+Plugin