Setting frequecy for Camus jobs - camus

I have just started with Camus. I am planning to run camus job every hour. We get ~80000000 messages (with ~4KB avg size) every hour.
How do I set the following properties:
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
I am not able to make out these configurations clearly. Should I put days as 1 and and hours property as 2?
How does camus pull the data? Often I see the following error also:
ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset
Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset.
How do I set the configurations correctly to run every hour and avoid that error?

"Offset range from kafka metadata is outside the previously persisted offset ."
Indicates that your fetching is not as fast as the kafka's pruning.
kafka's pruning is defined by log.retention.hours.
1st option :Increase the retention time by changing "log.retention.hours"
2nd Option :Run it with higher frequency.
3rd Option :Set in your camus job kafka.move.to.earliest.offset=true.
This property will force camus to start consuming from the earliest offset currently present in the kafka. But this may lead to data loss since we are not accounting for the pruned data which we were not able to fetch.

Related

How to use cAdvisor data to calculate network bandwidth usage (per month) in grafana?

I’m using Prometheus (incl. cAdvisor) and Grafana to monitor my server, on which docker containers are running. cAdvisor gives me the data for my docker containers.
I’m trying to monitor the network bandwidth usage for the selected time (on the top right corner of grafana). It should output a value like 15 GB (for the selected month) or 500 MB (for the selected day).
My approach so far:
In Grafana I am using the Stat UI with Value options > Calculation of Total, while using the following query:
sum(rate(container_network_receive_bytes_total{instance=~"$host",name=~"$container",name=~".+"}[1m]))
(FYI: I have a container variable to filter the values for the selected container. This is why you can find the part ,name=~"$container" in the query above.)
The problem with the approach above is, that the outputted values do not seem to be right, because if I change the time range to a smaller one, I will get a bigger value. For instance, if I select Last 2 days the output is 1.19 MB, while selecting Last 24 hours gives me 2.38 MB. Of course, this does not make sense, because yesterday + today can’t be smaller than just today.
What do I oversee? How can I achieve it to output correct values?

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Grafana + Prometheus: Display single stat of how often an event occurred

How do I use Prometheus + Grafana to tell how many time an event occurs
during a given time period?
I have a Prometheus counter that I increment every time this event happens. I would like to display it in a Singlestat number. It seems like this should be as simple as:
sum(increase(some_event_happened{application="example-app"}[$__range]))
And the display set to "Current" value.
However, this gives numbers that are much higher than the actual number of events in the given range. Also, it seems to vary based on how much I offset the range, and how large the range is.
More importantly, it crashes our Prometheus server with an out of memory error when I have three or four of these on a single dashboard.
I've tried setting a recorded rule to address the crashes, but I haven't figured out the right way to slice up the record rule to still be able to display the Grafana range.
So in summary, I want a Singlestat displaying the number of times an event happened in the current time range set in the Grafana dashboard. It seems like this is a very basic thing for a monitoring system. Am I just using the wrong approach?
I've encountered similar issues and they appear to be due to discrepancies between the query interval (in Prometheus) and the min step (in Grafana). Try using this global, built-in variable for your interval, which will make sure Prometheus is always in sync with the Grafana step: $__interval.
sum(increase(some_event_happened{application="example-app"}[$__interval]))
http://docs.grafana.org/reference/templating/
https://www.stroppykitten.com/technical/prometheus-grafana-statistics

How to alert in Seyren with Graphite if transactions in last 60 minutes are less than x?

I'm using Graphite+Statsd (with Python client) to collect custom metrics from a webapp: a counter for successful transactions. Let's say the counter is stats.transactions.count, that also has a rate/per/second metric available at stats.transactions.rate.
I've also setup Seyren as a monitor+alert system and successfully pulled metrics from Graphite. Now I want to setup an alert in Seyren if the number of successful transactions in the last 60 minutes is less than a certain minimum.
Which metric and Graphite function should I use? I tried with summarize(metric, '1h') but this gives me an alert each hour when Graphite starts aggregating the metric for the starting hour.
Note that Seyren also allows to specify Graphite's from and until parameters, if this helps.
I contributed the Seyren code to support from/until in order to handle this exact situation.
The following configuration should raise a warning if the count for the last hour drops below 50, and an error if it drops below 25.
Target: summarize(nonNegativeDerivative(stats.transactions.count),"1h","sum",true)
From: -1h
To: [blank]
Warn: 50 (soft minimum)
Error: 25 (hard minimum)
Note this will run every minute, so the "last hour" is a sliding scale. Also note that the third boolean parameter true for the summarize function is telling it to align its 1h bucket to the From, meaning you get a full 1-hour bucket starting 1 hour ago, rather than accidentally getting a half bucket. (Newer versions of Graphite may do this automatically.)
Your mileage may vary. I have had problems with this approach when the counter gets set back to 0 on a server restart. But in my case I'm using dropwizard metrics + graphite, not statsd + graphite, so you may not have that problem.
Please let me know if this approach works for you!

Resources