Graphite has null values between data points - docker

I have an API that fetches data packets from different servers. It formats this data to different small JSON units. I wrote an algorithm that sends them to graphite with the command json2graphite.
The sending works very well, the incoming data doesn't look bad either.
Now the problem:
The data displayed in graphite shows that each entry is followed by a null.
The data points that should be connected
I am aware that this data can also be connected using a function provided by the Graphite interface, but this doesn't help because Grafana boards always jump back and forth between value and null.
Is there a way to tell Grafana that it only goes to null if there was no data for more than 1 min or so?
I already tried to fix the problem with the data from "storage-schemas.conf" and "storage-aggregation.conf". Unfortunately without success.
storage-schemas.conf:
[default_1min_for_1day]
pattern = .*
retentions = 10s:6h,30s:8d,1m:31d,10m:1y,1h:5y
aggregation.conf:
[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
If you want to know any more, ask me. : )

Grafana has an option to connect datapoints that are separated by nulls. You can see how to enable this in the screenshot shown under Display Styles settings on Grafana's documentation.
In Graphite composer you can also do it by specifying the connected line mode under Graph options here:
Additionally, you could use Graphite's keepLastValue function to carry the last received value over gaps where there are nulls.

I haven't found a direct solution but I will now try to minimize the interval between the entries. I noticed that the requests take much too long: 2-5 minutes.
There are probably too many servers, so the requests block the port too long.
The problem is not solved yet but I think I will mark it as solved if nobody says I have the problem within 5 days.

Related

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

In InfluxDB/Telegraf How to compute difference between 2 fields based on 3rd field

I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

National Weather Service (NOAA) REST API returns nil for parameters of forecast

I am using the NWS REST API as my weather service for an app I am making. I was initially reluctant to use NWS because of its bad documentation, but I couldn't resist as it is offered completely free.
Now that I am trying to use it, I am running into some difficulty. When making a request for multiple days, the minimum temperature appears nil for several days.
(EDIT: As I have been testing the API more I have found that it is not always the minimum temperatures that are nil. It can be a max temp or a precipitation, it seems completely random. If you would like to make test calls using their web interface, you can do so here: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdBrowserByDay.htm
and here: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXML.htm)
Here is an example of a request the minimum temperatures are empty: http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdBrowserClientByDay.php?listLatLon=40.863235,-73.714780&format=24%20hourly&numDays=7
Surprisingly, on their website, the minimum temperatures are available:
http://forecast.weather.gov/MapClick.php?textField1=40.83&textField2=-73.70
You'll see under the Minimum temperatures that it is filled with about 5 (sometimes less, it is inconsistent) blank fields that say <value xsi:nil="true"/>
If anybody can help me it would be greatly appreciated, using the NWS API can be a little overwhelming at times.
Thanks,
The nil values, from what I can understand of the documentation, here and here, simply indicate that the data is unavailable.
Without making assumptions about NOAA's data architecture, it's conceivable that the information available via the API may differ from what their website displays.
Missing values are represented by an empty element and xsi:nil=”true” (R2.2.1).
Nil values being returned seems to involve the time period. Notice the difference between the time-layout keys (see section 5.3.2) in 1 in these requests:
k-p24h-n7-1
k-p24h-n6-1
The data times are different.
<layout-key> element
The key is derived using the following convention:
“k” stands for key.
“p24h” implies a data period length of 24 hours.
“n7” means that the number of data times is 7.
“1” is a sequential number used to keep the layout keys unique.
Here, startDate is the factor. Leaving it off includes more time and might account for some requested data not yet being available.
Per documentation:
The beginning day for which you want NDFD data. If the string is empty, the start date is assumed to be the earliest available day in the database. This input is only needed if one wants to shorten the time window data is to be retrieved for (less than entire 7 days worth), e.g. if user wants data for days 2-5.
I'm not experiencing the randomness you mention. The folks on NOAA's Yahoo! Groups forum might be able to tell you more.

Resources