What is the minimum scrape_interval in Prometheus? - monitoring

I am wondering what the minimum time is for Prometheus' scrape_interval parameter. According to the Prometheus Documentation, the value for this parameter needs to follow a regex which seems to me that only intervals equal or greater than 1 second are allowed, since, e.g. "1ms" or "0.01s" do not match this regex. In my application however, I would like to have scraping in milliseconds, so I am interested in whether this is possible with Prometheus.
Many thanks in advance!

According to the Prometheus documentation, the minimum value you can give for the scrape_interval seems to be 0 (according to the given regex in the docs).
Regex - ((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
According to this regex, you can specify scrape_interval in ms as well. But you need to specify it as 0s1ms. This is because if you specify the time as 1ms; 1m will match with minutes and remaining s will cause an error (didn't really test this scenario, but looks like this is the expected outcome by looking at the regex).

While Prometheus supports scrape intervals smaller than one second as described in this answer, it isn't recommended to use scrape_interval values smaller than one second because of the following issues:
Non-zero network delays between Prometheus and scrape target. These delays are usually in the range 0.1ms - 100ms depending on the distance between Prometheus and scrape target.
Non-zero delays in scrape target's response handler, which generates the response for Prometheus.
These non-deterministic delays may introduce big relative errors to scrape timings for scrape_interval values smaller than one second.
Too small scrape_interval values also may result in scrape errors if the target cannot be scraped during the configured scrape interval. In this case Prometheus would store up=0 metric for every unsuccessful scrape. See these docs about up metric.
P.S. If you need storing high-frequency samples into time series, then it would be better pushing these samples directly to a monitoring system, which supports push protocols for data ingestion. For example, VictoriaMetrics supports popular push protocols such as Influx, Graphite, OpenTSDB, CSV, DataDog etc. - see these docs for details. It supports timestamps with millisecond precision. If you need even higher precision for timestamps, then take a look at InfluxDB - it supports timestamps with nanosecond precision. Note that too high precision for timestamps usually leads to increased resource usage - disk space, RAM, CPU.
Disclosure: I work on VictoriaMetrics.

Related

How to get a precise value for a Docker container's network usage

For the project I'm working on, I need to be able to measure the total amount of network traffic for a specific container over a period of time. These periods of time generally are about 20 seconds, and the precision needed realistically is only in kilobytes. Ideally this solution would not involve the use of additional software either in the containers or on the host machine, and would be suitible for linux/windows hosts.
Originally I had planned to use the 'NET I/O' attribute of the 'docker stats' command, though the field is automatically formatted to a more human readable format (i.e. '200 MB') which means for containers that have been running for some time, I can't get the precision I need.
Is there any way to get the raw value of 'NET I/O' or to reset the running count? Outside of this I've explored using something tshark or iptables, but like I said above ideally the solution would not require additional programs. Though if there aren't any good solutions that fit that criteria any other suggestions would be welcome. Thank you!

Prometheus CPU Usage Histogram Metrics

my goal is to observe metrics (like CPU, Memory usage etc.) with Prometheus on a server and on its running docker containers. Before sending an alarm, I would like to compare the certain values of those metrics with e.g. an 0.95 quantile. However, over several weeks of search in the internet I still struggle to create metrics for the certain quantiles. Therefore I ask in this thread for your help/advice, how a quantile for certain metrics can be created.
Background
The code base is a fork of the docprom repository. This code relies on Prometheus for monitoring. Prometheus retrieves its data from a running cAdvisor container. The provided metrics of cAdvisor for Prometheus can be seen on the following page. However, it provides only Gauge and Counter metric types. During my research I was not able to find parameters that would enable modifications/extensions of those provided metrics.
Problem
According to my current understanding, the metric type should be a Histogram or Summary in order to observe the quantiles. What is the best approach to use the histogram_quantile query on the metrics provided by cAdvisor?
My current idea is to
create a custom server
fetch the desired data from Prometheus
calculate the desired data
provide it as a metric from the server, so that Prometheus can scrape it
Run histogram_quantile on the custom metric
Is it the right approach in order to create a metric that can be used with quantiles?
For example I would like to fire an alarm if a certain containers' CPU usage exceeds a 0,95 quantile. The code for the CPU usage can be seen exemplary below:
sum(rate(container_cpu_usage_seconds_total{name="CONTAINER_NAME"}[10m]))) / count(node_cpu_seconds_total{mode="system"}) * 100
What would be the best approach to create the desired quantiles? Am I on the right path or am I missing something simple here? Because it looks way too hard for me in order to get a simple query with a quantile.
I am thankful for all help and information.

InfluxDB disable downsampling

I like some functions of influxDB that's why I would like to use it instead of just MySQL etc.
In my case I need to pull from the DB exactly the same time series I pushed into it and any data change between what I put and what I got is considered a data corruption.
Is it possible to disable downsampling in InfluxDB?
as per
documentation
these are features: Continuous Queries (CQ) and Retention Policies (RP) but they are not optional and forced to be used. Am I right? or there is a way of turning these things off?
Is there any other time series database that supports statistical functions and works with Grafana but does not have downsampling (or it is optional)?
Continuous Queries (CQ) and Retention Policies (RP) are optional. You don't need to use them. You can use default retention policy named autogen which has infinite retention and you can keep data with original granularity forever (= unless you will reach some resource limits - disk/memory/response times/...).

What is the recommended hardware for the following neo4j setup?

I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API
I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.
You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.

Naming statsd metrics for short lived streams

I am trying to model statistics to submit to statsd/graphite. However what I am monitoring is "session" centric. For example, I have a game that is played in real time. There are multiple instances of a game active on the servers. Each game has multiple (and variable number of) participants. Each instance of a game has a unique ID as does each player.
I want to track (and graph) each player's stats but then roll the metric up for the whole instance and then for all the instances of a game. For example there may be two instances of a game active at a given time. Lets say each has two players in the game
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_1 10
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_2 20
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_3 50
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_4 70
where game_instances and player_ids are 128 bit numbers
And I want to be able to see that the value of all voice errors for game_instance_a is 30
while all voice errors across the system is 150
Given this I have three questions
What guidance would you have on naming the metrics.
Is it kosher to have metrics that have "dynamic" identifiers as part of the name
What are they scale limits on this. If I had a 100K game instances
with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Thanks!
What guidance would you give on naming the metrics?
Graphite recommends that "Volatile path components should be kept as deep into the hierarchy as possible". This essentially means that if you can push the parts of the metrics that are frequently unique to the end of the "bucket" without impacting your grouping queries you should try to do so.
Here is a great post on using Graphite that includes naming recommendations. And here is another one with additional info from Jason Dixon (an excellent source for Graphite stuff in general).
Is it kosher to have metrics that have "dynamic" identifiers as part of the name?
I usually try to avoid identifiers in the metric names unless they are very low in number (<100). Because Graphite will store a .wsp file for every metric name you'll have a difficult time re-sizing or adjusting the storage settings should you decide to change your configuration. Additionally, the Graphite UI will have a "folder" for every metric name so you can easily make the UI unusable.
In your case, I'd probably graph the total number of game instances, the total number of players, and the number of errors (by type), etc. Additionally, I might try to track players per instance (generally) and maybe errors per instance (again without knowing the actual instance. e.g. GameTitle.RealTime.PerInstance.VoiceErrors) if I had that capability (i.e. state stored per instance in my application).
Logstash, Elastic Search, Kibana
I'd suggest logging this error information with instance and player ids and using logstash to send your logs to elastic search and kibana. Then I'd watch Graphite for real time error and health anomaly detection and use Kibana (and Elastic Search underneath) to dig deeper.
What are the scale limits on this. If I had a 100K game instances with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Statsd should have no problem with this, as it just acts as a -mostly- dumb aggregator. While it does maintain some state internally I don't anticipate a problem.
I don't think you'll have problems with the internal Graphite Whisper Storage itself, as it is just using files and folders. But, as I mentioned above, the Graphite Web UI will be unusable and I think you'll also run the risk of other manageability issues.
Summary
Keep the volatile (dynamic) metric buckets at the end of the name and avoid going above a couple hundred of these.

Resources