Using a Grafana Histogram with Prometheus Buckets - histogram

I have a Prometheus metric called latency, with a bunch of buckets.
I'm using an increase query to determine all the events that happened in the last 15 minutes (in all the buckets).
This query works well, switching to table view shows numbers that make sense: most latencies are below 300ms, with some above that value:
However, when I use a Grafana Histogram, it seems like x and y axis are interchanged:
Now I could use a different diagram style, like a Bar Gauge. I tried that, but it doesn't work well: I have too many buckets, so the labels become totally illegible. Also it forces me to display all the buckets that my application collects, but it would be nice if that wasn't set in stone, and I could aggregate buckets in Grafana. It also doesn't work well once I change the bucket sizes to exponential sizes.
Any hint how to either get the Histogram working properly (with x axis: bucket (in s), y axis: count), or another visualization that would be appropriate here? My preferred outcome would be something like the plot of a function.

Answer
The Grafana panel type "Bar gauge" with format option "Heatmap" and interval $__range seems to be the best option if you have a small number of buckets. There is no proper solution for large number of buckets, yet.
The documentation states that the format option "Heatmap" should work with panel tape "Heatmap" (and it does), see Introduction to histograms and heatmaps with Pre-bucketed data. The Heatmap panel has an option to produce a histogram on mouseover, so you might want to use this.
About panel type Histogram
The Grafana panel type "Histogram" produces a value distribution and the value of some bucket is a count. This panel type does not work well with Prometheus histograms, even if you switch from format option "Time series" to "Heatmap". I don't know if this is due to the beta status of this panel type in the Grafana Version I am currently using (which is 9.2.4). There are also open bugs, claiming that the maximum value of the x axis is not computed correctly, see issue 32006 and issue 33073.
The larger the number of buckets, the better the estimation of histogram_quantile(). You could let the Histogram panel calculate a distribution of latencies by using this function. Let's start with the following query:
histogram_quantile(1, sum by (le) (rate(latency_bucket{...}[$__rate_interval])))
You could now visualize the query results with the Histogram panel and set the bucket size to a very small number such as 0.1. The resulting histogram ignores a significant amount of samples as it is only related to the maximum value of all data points within $__rate_interval.
The values on the y-axis depend on the interval. The smaller the intervall, the higher the values, simply due to more data points in the query result. This is a big downside, you loose the exact number of data points which you originally had in the buckets.
I can not really recommend this, but it might be woth a try.
Additional notes
Grafana has a transform functions like "Create heatmap" and "Histogram", but these are not useful for Prometheus histogram data. Note that "Create heatmap" allows to set one dimension to logarithmic.
There are two interesting design documents that show, that the developers of Prometheus are aware of problems with the current implementation of histograms and work on some promising features:
Sparse high-resolution histograms for Prometheus
Prometheus Sparse Histograms and PromQL
See DESIGN DOCUMENTS.
There also is this feature request Prometheus histogram as stacked chart over time #11464.
There is an excellent overview about histograms: How to visualize Prometheus histograms in Grafana.

Related

Prometheus latency graph in histogram and calculate percentile

I need to plot latency graph on prometheus by the histogram time-series, but I've been unsuccessful to display a histogram in grafana.
What I expect is to be able to show:
Y-axis is latency, x-axis is timeseries.
Each line representing the p50,p75,p90,p100 - aggregated for a given time window.
A sample metric would be the request time of an nginx.
suppose if i have a histogram like this,
nginx_request_time_bucket(le=1) 1,
nginx_request_time_bucket(le=10) 2,
nginx_request_time_bucket(le=60) 2,
nginx_request_time_bucket(le=+inf) 5
An example graph of what I am looking for is in this link,
[][]
[click]: https://www.instana.com/blog/how-to-measure-latency-properly-in-7-minutes/
I tried to picture histogram with heatmap using this query but this doesn't give me what im looking for. Im looking something similar to the graph
histogram_quantile(0.75, sum(rate(nginx_request_time_bucket[5m])) by (le))
Any help here is highly appreciated!
You need to set up a separate query on a Grafana graph per each needed percentile. For example, if you need p50, p75, p90 and p100 latencies over the last 5 minutes, then the following four separate queries should be set up in Grafana:
histogram_quantile(0.50, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(0.75, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(nginx_request_time_bucket[5m])) by (le))
histogram_quantile(1.00, sum(rate(nginx_request_time_bucket[5m])) by (le))
P.S. It is possible to compact these queries into a single one when using histogram_quantiles function from Prometheus-compatible query engine such as MetricsQL:
histogram_quantiles(
"percentile",
0.50, 0.75, 0.90, 1.00,
sum(rate(nginx_request_time_bucket[5m])) by (le)
)
Note that the accuracy for percentiles calculated over Prometheus histograms highly depends on the chosen buckets. It may be hard to choose the best set of buckets for the given set of percentiles, so it may be better to use histograms with automatically generated buckets. See, for example, VictoriaMetrics histograms.

How to Address Noise Resulting from Inverse-Scaling for a Machine Learning Task

I'm still a little unsure of whether questions like these belong on stackoverflow. Is this website only for questions with explicit code? The "How to Format" just tells me it should be a programming question, which it is. I will remove my question if the community thinks otherwise.
I have created a neural network and am predicting reasonable values for most of my data (the task is multi-variate time series forecasting).
I scale my data before inputting it using scikit-learn's MinMaxScaler(0,1) or MinMaxScaler(-1,1) (the two primary scalings I am using).
The model learns, predicts, and I inverse the scaling using MinMaxScaler()'s inverse_transform method to visually see how close my predictions were to the actual values. However, I notice that the inverse_scaled values for a particular part of the vector I predicted have now become very noisy. Here is what I mean (inverse_scaled prediction):
Left end: noisy; right end: not-so noisy.
I initially thought that perhaps my network didn't learn that part of the vector well, so is just outputting ~random values. BUT, I notice that the predicted values before the inverse scaling seem to match the actual values very well, but that these values are typically near 0 or -1 (lower limit of the feature scale) because of the fact that these values have a very large spread (unscaled mean= 1E-1, max= 1E+1 [not an outlier]). Example (scaled prediction):
So, when inverse transforming these values (again, often near -1 or 0), the transformed values exhibit loud noise, as shown in the images.
Questions:
1.) Should I be using a different scaler/scaling differently, perhaps one that exponentially/nonlinearly scales? MinMaxScaler() scales each column. Simply dropping the high-magnitude data isn't an option since they are real, meaningful data. 2.) What other solutions can help this?
Please let me know if you'd like anything else clarified.

Get a blobs Skewness by inspecting 3rd Order Moment?

I've been reading up on Image Moments and it looks like they are very useful for efficiently describing a blob. Apparently, the 3rd order moment represents/can tell me about a blob's skewness (is this correct?).
How can I get the 3rd order moment in OpenCV? Do you have a calculation/formula you can point me to?
Moments m = moments(contour, false);
// Are any of these the 3rd order moment?
m.m03;
m.mu03;
m.nu03;
As stated in the OpenCV docs for moments():
Calculates all of the moments up to the third order of a polygon or rasterized shape.
So yes, moments() does return what you're after.
The three quantities you mention, m03, mu03, nu03 are all different types of the third or moment.
m03 is the third order moment
mu03 is the third order central moment, i.e. the same as m03 but if the blob was centered around (0, 0)
nu03 is mean-shifted and normalized, i.e. the same as mu03 but divided by an area.
Let's say you wanted to describe a shape but be agnostic to the size or the location of it in your image. Then you would use the mean-shifted and normalized descriptors, nu03. But if you wanted to keep the size as part of the description, then you'd use mu03. If you wanted to keep the location information as well, you'd use m03.
You can think about it in the same way as you think about distributions in general. Saying that I have a sample of x = 500 in a normal distribution with mean 450 and standard deviation 25 is basically the same thing as saying I have a sample of x = 2 in a normal distribution with mean 0 and std dev 1. Sometimes you might want to talk about the distribution in terms of it's actual parameters (mean 450, std dev 25), sometimes you might want to talk about the distribution as if it's mean centered (mean 0, std dev 25), and sometimes talk about it as if it's the standard Gaussian (mean 0, std dev 1).
This excellent answer goes over how to manually calculate the moments, which in tandem with the 'basic' descriptions I gave above, should make the formulas in the OpenCV docs for moments() a little more comfortable.

Non-linear interaction terms in Stata

I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
marginsplot
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Resources