How can I get envoyproxy/ratelimit statistics for descriptors without value? - rate-limiting

I am using envoyproxy/ratelimit (along with Istio) to setup a global rate limiting in my k8s cluster for a given service. The rate limit is based on a header (in my case the username) so that each username is limited by the number of RPS. The following configuration was used to achieve this:
domain: ratelimit
descriptors:
- key: USERNAME
rate_limit:
unit: second
requests_per_unit: 100
shadow_mode: true
Also, I used a EnvoyFilter (Istio CRD) to define which header will used.
The resulting metric does not show a label for a specific user, just for the entire descriptor:
ratelimit_service_rate_limit_within_limit{app="ratelimit",domain="ratelimit",instance="xxx",job="kubernetes-pods",key1="USERNAME",kubernetes_namespace="xxx",kubernetes_pod_name="ratelimit-xxx",pod_template_hash="xxx",security_istio_io_tlsMode="istio",service_istio_io_canonical_name="ratelimit",service_istio_io_canonical_revision="latest"}
So my question is: how can I get the metrics for a specific username? Considering my configuration is applied to all of them and not for a specific value.

Thanks to this PR you can now add a detailed_metric parameter to enable this behavior, as shown in this example.

Related

How long Prometheus timeseries last without and update

If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

In InfluxDB/Telegraf How to compute difference between 2 fields based on 3rd field

I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?

Lowest value from 2 payloads in Node-Red

I have a IoT system in home and two temperature sensors.
One of the sensors could work in some hours in direct sun.
The real temperature is always the lowest value, so sometimes temp1, sometimes temp2.
What I want to achieve is:
read the temperature from sensors1 (via MQTT)
read the temperature from sensors2 (via MQTT)
compare values
find the lowest one and send in via MQTT
go back to reading in loop
For this example I can simulate readings with injection nodes
How to do that? I am new in Node-Red, have tried but without success.
Here is my flow:
[{"id":"fa6372cc.47f92","type":"tab","label":"Flow 8","disabled":false,"info":""},{"id":"5ac90e03.22da3","type":"join","z":"fa6372cc.47f92","name":"","mode":"custom","build":"object","property":"payload","propertyType":"msg","key":"topic","joiner":"","joinerType":"str","accumulate":true,"timeout":"","count":"2","reduceRight":false,"reduceExp":"","reduceInit":"","reduceInitType":"","reduceFixup":"","x":990,"y":340,"wires":[["f09774bf.3c8428","a197b84d.6a7338"]]},{"id":"f09774bf.3c8428","type":"debug","z":"fa6372cc.47f92","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","x":1130,"y":340,"wires":[]},{"id":"43900e79.98cd8","type":"change","z":"fa6372cc.47f92","name":"set payload value","rules":[{"t":"set","p":"payload","pt":"msg","to":"req.params.value","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":790,"y":340,"wires":[["5ac90e03.22da3"]]},{"id":"b71d9143.c03bd","type":"change","z":"fa6372cc.47f92","name":"set topic temp1","rules":[{"t":"set","p":"topic","pt":"msg","to":"temp1","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":560,"y":320,"wires":[["43900e79.98cd8"]]},{"id":"e87114aa.6cd1","type":"change","z":"fa6372cc.47f92","name":"set topic temp2","rules":[{"t":"set","p":"topic","pt":"msg","to":"temp2","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":560,"y":360,"wires":[["43900e79.98cd8"]]},{"id":"783c47fd.8dd58","type":"inject","z":"fa6372cc.47f92","name":"temp source 2","topic":"","payload":"12","payloadType":"num","repeat":"3","crontab":"","once":false,"onceDelay":"1.5","x":380,"y":360,"wires":[["e87114aa.6cd1"]]},{"id":"271dedab.aaa7b2","type":"inject","z":"fa6372cc.47f92","name":"temp source 1","topic":"","payload":"10","payloadType":"num","repeat":"2","crontab":"","once":false,"onceDelay":"1","x":380,"y":320,"wires":[["b71d9143.c03bd"]]},{"id":"a197b84d.6a7338","type":"mqtt out","z":"fa6372cc.47f92","name":"temperature","topic":"domoticz/in","qos":"","retain":"","broker":"7e3561ec.acad","x":1150,"y":280,"wires":[]},{"id":"7e3561ec.acad","type":"mqtt-broker","z":"","name":"Domoticz","broker":"192.168.6.11","port":"8084","clientid":"","usetls":false,"compatmode":true,"keepalive":"60","cleansession":true,"birthTopic":"","birthQos":"0","birthRetain":"false","birthPayload":"","closeTopic":"","closeRetain":"false","closePayload":"","willTopic":"","willQos":"0","willRetain":"false","willPayload":""}]
One way to do it would be like this:
This is storing the two temps in flow variables - the first flow initially sets them to a high number so the "min" in "choose lower value" will later work. In this case I've used a change node setting the payload to the JSONata of
$min([$flowContext("temp1"), $flowContext("temp2")])
but there's a few ways you could choose to do it.
Here is the code to try:
[{"id":"6bc2755e.9feb9c","type":"debug","z":"f454a93f.0e89d8","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","x":990,"y":340,"wires":[]},{"id":"38bd03eb.f7d06c","type":"change","z":"f454a93f.0e89d8","name":"choose lower value","rules":[{"t":"set","p":"payload","pt":"msg","to":"$min([$flowContext(\"temp1\"), $flowContext(\"temp2\")])\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":790,"y":340,"wires":[["6bc2755e.9feb9c"]]},{"id":"9066677f.eb0358","type":"change","z":"f454a93f.0e89d8","name":"store temp1","rules":[{"t":"set","p":"temp1","pt":"flow","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":550,"y":320,"wires":[["38bd03eb.f7d06c"]]},{"id":"a70c9b2a.e7db58","type":"change","z":"f454a93f.0e89d8","name":"store temp2","rules":[{"t":"set","p":"temp2","pt":"flow","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":550,"y":360,"wires":[["38bd03eb.f7d06c"]]},{"id":"4bd27616.d022c8","type":"inject","z":"f454a93f.0e89d8","name":"temp source 2","topic":"","payload":"12","payloadType":"num","repeat":"","crontab":"","once":false,"onceDelay":"1.5","x":370,"y":360,"wires":[["a70c9b2a.e7db58"]]},{"id":"7378dd4f.3825b4","type":"inject","z":"f454a93f.0e89d8","name":"temp source 1","topic":"","payload":"10","payloadType":"num","repeat":"","crontab":"","once":false,"onceDelay":"1","x":370,"y":320,"wires":[["9066677f.eb0358"]]},{"id":"314eb0ec.85211","type":"inject","z":"f454a93f.0e89d8","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":true,"onceDelay":0.1,"x":370,"y":260,"wires":[["688646b.138a6b8"]]},{"id":"688646b.138a6b8","type":"change","z":"f454a93f.0e89d8","name":"set to high","rules":[{"t":"set","p":"temp1","pt":"flow","to":"999","tot":"num"},{"t":"set","p":"temp2","pt":"flow","to":"999","tot":"num"}],"action":"","property":"","from":"","to":"","reg":false,"x":550,"y":260,"wires":[[]]}]

Google Dataflow custom metrics not showing on Stackdriver

I'm trying to get a deeper view on my dataflow jobs by measuring parts of it using Metrics.counter & Metrics.gauge but I cannot find them on Stackdriver.
I have a premium Stackdriver account and I can see those counters under the Custom Counters section on the Dataflow UI.
I can see droppedDueToLateness 'custom' counter though on Stackdriver that seems to be created via Metrics.counter as well...
Aside from that, there's something that could be helpful that is that when I navigate https://app.google.stackdriver.com/services/dataflow the message I get is this:
"You do not have any resources of this type being monitored by Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't properly connected to Stackdriver, but, on the other hand. Some metrics are displayed and can be monitored such as System Lag, Watermark age, Elapsed time, Element count, etc...
What am I missing?
Regards
Custom metric naming conventions
When defining custom metrics in Dataflow, you have to adhere to the custom metric naming conventions, or they won't show up in Stackdriver.
Relevant snippet:
You must adhere to the following spelling rules for metric label
names:
You can use upper and lower-case letters, digits, underscores (_) in
the names.
You can start names with a letter or digit.
The maximum length of a metric label name is 100 characters.
If you create a metric with
Metrics.counter('namespace', 'name')
The metric shows up in stackdriver as custom.googleapis.com/dataflow/name, so 'name' should adhere to the rules mentioned above. The namespace does not seem to be used by Stackdriver.
Additional: labels
It doesn't seem possible to add labels to the metrics when defined this way. However, the full description of each time series of a metric is a string with the format
'name' job_name job_id transform
So you can aggregate by these 4 properties (+ region and project).

Resources