how to cluster percentile of events by time delta? - influxdb

After a mailing at t0, I will have several "delivered" (and open and click) events (schema and example)
mailing_name, timestamp, email_id, event_type
niceattack, 2016-07-14 12:11:00, 42, open
niceattack, 2016-07-14 12:11:08, 842, open
niceattack, 2016-07-14 12:11:34, 847, open
I would like to see for a mailing how long it takes to be delivered to half of the recipients. So say that I'm sending an email to 1000 addresses now, the first open event is in 2 min, the last one is going to be in a week (and min/max first last seems to be easy to find) but what I'd like to see is that half of the recipients opened it in the first 2 hours after it was sent.
The goal is to send being able to compare is sending now vs on sat morning makes a difference on how fast it's open on average, or if one specific mailing get quicker exposure, and correlate that with other events (how many click on a link, take a specific action on our site...)
I tried to use a cumulate function (how many open event for mailing for each point), but it seems that the cumulative function isn't yet implemented https://github.com/influxdata/influxdb/issues/813
How do you solve that problem with influxdb?

Solving this problem with InfluxDB alone is not currently possible, however if you're willing to add Kapacitor into the mix, then it should be possible. In particular you'll need to write a User Defined Function (UDF) for that cumulative function in Kapacitor.
The general process will look like the following:
Install and Configure Kapacitor
Create a UDF for the cumulative function you're looking for
Enable that UDF inside of Kapacitor
Write a TICKscript that uses the UDF and writes the results back to InfluxDB
Enable a task defined by the TICKscript you've written
Query the InfluxDB instance to get the results of the cumulative function.
My appoligies for being so high level on this. This is a fairly involved process, but should give you the result you're looking for.

Related

Is there a way in grafana to get number of requests at a given time instance?

I have one endpoint for which I would like to see number of requests at a given time (not period). For instance, how many requests were received at 9.30 a.m.
The function I believe I can make use of might be: echo_requests_total, but it just accumulates the count and if I try increase() or rate() functions even those do not produce the expected output, which is obvious.
I am not even sure about what I want is even possible or not.
Any help would be appreciated.

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Grafana: Panel with time of last result

I have an elasticsearch instance that receives logs from multiple backup routines. I'd like to query ES for these logs from Grafana and set up a panel that shows the last time for the different backups. Ideally I would also like to be able to show this in color if the time is longer than a certain threshold.
Basically the idea is to have a display that shows, for instance, green if a certain backup has been completed in the last 24 hours, and red if it hasn't.
How would I do this in Grafana with ES as the datasource?
Exact implementation depends on the used panel.
Example for singlestat: write ES query and then select Stat: Time of last point, you may need to select suitable unit/format:
Unfortunately, Grafana doesn't understand thresholds in your requested time format (older than 24 hours). You will need to return it as metric (for example as age of last backup in seconds) = you will need to write query for that. That means, that you will have 2 stats to show (last time + age), so you won't be able to use singlestat. Probably table panel will be better - you can use thresholding based on the age metric there.
In addition to the great answer by Jan Garaj, it looks like there is work being done to make this type of thing much easier in the future.
Check out this issue to check progress.

In InfluxDB/Telegraf How to compute difference between 2 fields based on 3rd field

I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?

Resources