Calculating average using a Histogram metric - histogram

I am generating a Histogram using a prometheus client. The metric name is retrieve_stripe_subscription_latency_ms. Since Histogram generates additional metrics with suffixes _sum and _count, can I calculate the average using the below query in Grafana?
sum(retrieve_stripe_subscription_latency_ms_sum)/sum(retrieve_stripe_subscription_latency_ms_count)

I think official Prometheus documentation address this. Having your metric name:
To calculate the average [...put here a metric meaning...] during the last 5 minutes from a histogram or summary called retrieve_stripe_subscription_latency_ms, use the following expression:
rate(retrieve_stripe_subscription_latency_ms_sum[5m])
/
rate(retrieve_stripe_subscription_latency_ms_count[5m])

Related

Clustering suggestions: I have unlabelled dataset6 attributes(all numeric) and 100k datapoints. I want to do cluster similar datapoints

As part of preprocessing:
I have removed attributes that are high in correlation(>0.8).
standardized the data(Standard Scalar)
`#To reduce it to lower dimensions I used
umap =UMAP(n_neighbors=20,
min_dist=0,
spread=2,
n_components=3,
metric='euclidean')
df_umap = umap.fit_transform(df_scaled1)
#For Clustering I used HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True)
clusterer.fit(df_umap)
#Assign clusters to the original dataset
df['cluster'] = clusterer.labels_`
Data--(130351,6)
Column a
Column b
Column c
Column d
Column e
Column f
6.000194
7.0
1059216
353069.000000
26.863543
15.891751
3.001162
3.5
1303727
396995.666667
32.508957
11.215764
6.000019
7.0
25887
3379.000000
18.004558
10.993119
6.000208
7.0
201138
59076.666667
41.140104
10.972880
6.000079
7.0
59600
4509.666667
37.469000
9.667119
df.describe():
df.describe()
Results:
1.While some of the clusters have very similar data points;
example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster;
example: cluster: 5423.
Also cluster id '-1' have 36221 data points associated with it.
My questions:
Am I using the correct approach for the data I have and the result I am trying to achieve?
Is UMAP the correct choice for dimension reduction?
Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
Is there any method to assess cluster quality?

Using a Grafana Histogram with Prometheus Buckets

I have a Prometheus metric called latency, with a bunch of buckets.
I'm using an increase query to determine all the events that happened in the last 15 minutes (in all the buckets).
This query works well, switching to table view shows numbers that make sense: most latencies are below 300ms, with some above that value:
However, when I use a Grafana Histogram, it seems like x and y axis are interchanged:
Now I could use a different diagram style, like a Bar Gauge. I tried that, but it doesn't work well: I have too many buckets, so the labels become totally illegible. Also it forces me to display all the buckets that my application collects, but it would be nice if that wasn't set in stone, and I could aggregate buckets in Grafana. It also doesn't work well once I change the bucket sizes to exponential sizes.
Any hint how to either get the Histogram working properly (with x axis: bucket (in s), y axis: count), or another visualization that would be appropriate here? My preferred outcome would be something like the plot of a function.
Answer
The Grafana panel type "Bar gauge" with format option "Heatmap" and interval $__range seems to be the best option if you have a small number of buckets. There is no proper solution for large number of buckets, yet.
The documentation states that the format option "Heatmap" should work with panel tape "Heatmap" (and it does), see Introduction to histograms and heatmaps with Pre-bucketed data. The Heatmap panel has an option to produce a histogram on mouseover, so you might want to use this.
About panel type Histogram
The Grafana panel type "Histogram" produces a value distribution and the value of some bucket is a count. This panel type does not work well with Prometheus histograms, even if you switch from format option "Time series" to "Heatmap". I don't know if this is due to the beta status of this panel type in the Grafana Version I am currently using (which is 9.2.4). There are also open bugs, claiming that the maximum value of the x axis is not computed correctly, see issue 32006 and issue 33073.
The larger the number of buckets, the better the estimation of histogram_quantile(). You could let the Histogram panel calculate a distribution of latencies by using this function. Let's start with the following query:
histogram_quantile(1, sum by (le) (rate(latency_bucket{...}[$__rate_interval])))
You could now visualize the query results with the Histogram panel and set the bucket size to a very small number such as 0.1. The resulting histogram ignores a significant amount of samples as it is only related to the maximum value of all data points within $__rate_interval.
The values on the y-axis depend on the interval. The smaller the intervall, the higher the values, simply due to more data points in the query result. This is a big downside, you loose the exact number of data points which you originally had in the buckets.
I can not really recommend this, but it might be woth a try.
Additional notes
Grafana has a transform functions like "Create heatmap" and "Histogram", but these are not useful for Prometheus histogram data. Note that "Create heatmap" allows to set one dimension to logarithmic.
There are two interesting design documents that show, that the developers of Prometheus are aware of problems with the current implementation of histograms and work on some promising features:
Sparse high-resolution histograms for Prometheus
Prometheus Sparse Histograms and PromQL
See DESIGN DOCUMENTS.
There also is this feature request Prometheus histogram as stacked chart over time #11464.
There is an excellent overview about histograms: How to visualize Prometheus histograms in Grafana.

Prometheus latency graph in histogram and calculate percentile

I need to plot latency graph on prometheus by the histogram time-series, but I've been unsuccessful to display a histogram in grafana.
What I expect is to be able to show:
Y-axis is latency, x-axis is timeseries.
Each line representing the p50,p75,p90,p100 - aggregated for a given time window.
A sample metric would be the request time of an nginx.
suppose if i have a histogram like this,
nginx_request_time_bucket(le=1) 1,
nginx_request_time_bucket(le=10) 2,
nginx_request_time_bucket(le=60) 2,
nginx_request_time_bucket(le=+inf) 5
An example graph of what I am looking for is in this link,
[][]
[click]: https://www.instana.com/blog/how-to-measure-latency-properly-in-7-minutes/
I tried to picture histogram with heatmap using this query but this doesn't give me what im looking for. Im looking something similar to the graph
histogram_quantile(0.75, sum(rate(nginx_request_time_bucket[5m])) by (le))
Any help here is highly appreciated!
You need to set up a separate query on a Grafana graph per each needed percentile. For example, if you need p50, p75, p90 and p100 latencies over the last 5 minutes, then the following four separate queries should be set up in Grafana:
histogram_quantile(0.50, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(0.75, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(1.00, sum(rate(nginx_request_time_bucket[5m])) by (le))
P.S. It is possible to compact these queries into a single one when using histogram_quantiles function from Prometheus-compatible query engine such as MetricsQL:
histogram_quantiles(
"percentile",
0.50, 0.75, 0.90, 1.00,
sum(rate(nginx_request_time_bucket[5m])) by (le)
)
Note that the accuracy for percentiles calculated over Prometheus histograms highly depends on the chosen buckets. It may be hard to choose the best set of buckets for the given set of percentiles, so it may be better to use histograms with automatically generated buckets. See, for example, VictoriaMetrics histograms.

Correlation between time series

I have a dataset where a process is described as a time series made of ~2000 points and 1500 dimensions.
I would like to quantify how much each dimension is correlated with another time series measured by another method.
What is the appropriate way to do this (eventually done in python) ? I have heard that Pearson is not well suited for this task, at least without data preparation. What are your thoughts about that?
Many thanks!
A general good rule in data science is to first try the easy thing. Only when the easy thing fails should you move to something more complicated. With that in mind, here is how you would compute the Pearson correlation between each dimension and some other time series. The key function here being pearsonr:
import numpy as np
from scipy.stats import pearsonr
# Generate a random dataset using 2000 points and 1500 dimensions
n_times = 2000
n_dimensions = 1500
data = np.random.rand(n_times, n_dimensions)
# Generate another time series, also using 2000 points
other_time_series = np.random.rand(n_times)
# Compute correlation between each dimension and the other time series
correlations = np.zeros(n_dimensions)
for dimension in range(n_dimensions):
# The Pearson correlation function gives us both the correlation
# coefficient (r) and a p-value (p). Here, we only use the coefficient.
r, p = pearsonr(data[:, dimension], other_time_series)
correlations[dimension] = r
# Now we have, for each dimension, the Pearson correlation with the other time
# series!
len(correlations)
# Print the first 5 correlation coefficients
print(correlations[:5])
If Pearson correlation doesn't work well for you, you can try swapping out the pearsonr function with something else, like:
spearmanr Spearman rank-order correlation coefficient.
kendalltau Kendall’s tau, a correlation measure for ordinal data.

Time series distance metric

In order to clusterize a set of time series I'm looking for a smart distance metric.
I've tried some well known metric but no one fits to my case.
ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]:
I want to put this new example [sx] in the most similar cluster:
The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1) and d(sx, s2) < d(sx, s3)
edit
Here the results with metrics [cosine, euclidean, minkowski, dynamic type warping]
]3
edit 2
User Pietro P suggested to apply the distances on the cumulated version of the time series
The solution works, here the plots and the metrics:
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy, you can download via pip (pip install trendypy). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.

Resources