When to use Datadog Distribution and Histogram - histogram

I cannot find any article that describes the advantages of using datadog histogram compared to datadog distribution for apps that run on multi instance. Would someone kindly help me on deciding the best choice between those two?

One of the main differences that I found in the DataDog documentation:
The HISTOGRAM metric submission type represents the statistical distribution of a set of values calculated Agent-side in one time interval. The Agent aggregates the values that are sent in a defined time interval and produces different metrics which represent the set of values.
The DISTRIBUTION metric submission type represents the global statistical distribution of a set of values calculated across your entire distributed infrastructure in one time interval. A DISTRIBUTION metric sends all the raw data during a time interval to Datadog. Aggregations occur on the server-side. Because the underlying data structure represents raw, unaggregated data, distributions provide two major features: Calculation of percentile aggregations and Customization of tagging
https://docs.datadoghq.com/metrics/types/
https://docs.datadoghq.com/metrics/distributions/

Distributions provide enhanced query functionality and configuration options compared to histograms. Since aggregation happens at server-side for distribution styled metrics, you can calculate globally accurate percentiles for your services.
Histograms on the other hand, are aggregated on the agent side. So even if you see a .p99 time series in datadog, it will be per host. You can aggregate it in datadog by taking an average, but it won't be a true p99.
One use case that distribution helps solve is defining your service's SLOs.

Related

Dynamic clustering for panel data

I have panel data consisting of time series for 120 months, 45 institutions and approximately 8 variables for each one. I want to do a cluster analysis in order to detect stressed institutions based on dynamic clustering analysis. For instance, check if a stressed institution does move from one cluster to another, or if its behavior changes so much that it is no longer part of its own cluster.
The idea would be to use the information up to time t to cluster the institutions and get the clusters for each institution so it can evolve with new information and use all the information available up to that point from all the banks, with time varying clusters.
My first idea was to use statistical control techniques and anomaly detection for time series such as the ones in the package anomaly, but this procedure does not use all the information from the other banks, just its own. It might be that the whole system is stressed, so detecting an anomaly in one bank might be because of the system and not because of the particular bank.
I also tried using clustering in each period through hierarchical clustering, and did a decent job on classifying the institutions based on my knowledge of them. However, this procedure only uses data at each point in time, not all the data available up to that point.
I had the idea of using clustering methods for panel data at each point in time, using the data up to that point, and cycling through each month to get dynamic clusters using the whole dataset. However, I don't know if this approach makes sense, or if there are better methods to do this kind of analysis.
Thank you very much!

Probability Distribution and Data Distribution in ML

I have been reading about probability distributions lately and got confused that what actually is the difference between probability distribution and data distribution or are they the same? Also what actually is the importance of probability distribution in Machine Learning?
Thanks
Data distribution is a function or a listing that shows all the possible values (or intervals) of the data. This can help you decide if the set of good that you have is good enough to apply any techniques over it. You want to avoid skewed data.
Probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This helps you decide what type of statistical methods you can apply to your data. Example: If your data forms a Gaussian distribution then you already know how values would look like when they are 1-standard deviation away from normal and what is the probability of expecting more than 1-standard deviation.
NOTE: You may want to learn about how hypothesis testing is done for ML models.

Auto-trigger for thresholds with time series data

I wonder if the Clickhouse is a possible solution for the next task.
I'm collecting time-series data (for example pulse measurements of people)
I have different types of thresholds (for example min and max pulse value based on the age)
Once a pulse for an individual human reached the appropriate threshold, I want to trigger external service
In other words, what I looking for beyond a regular time-series storage is:
ability to set multiple thresholds
detect if the value is beyond the threshold automatically
emit some kind of event to 3rd party
Any other tools suggestions are appreciated. Thanks in advance.
Clickhouse have partial features for this task
you can try to write your own code (python, golang everything else) as an external process
which can use LIVE VIEWS and WATCH for trigger event detection, look article which describes these features
https://www.altinity.com/blog/2019/11/13/making-data-come-to-life-with-clickhouse-live-view-tables
and this code should emit an event to 3rd party system

Why is a feature good for distinguishing a cluster?

Let us supposed that we are trying to rank the importance of each feature of the dataset for each given cluster, in a clustering task. What are the characteristics that we should measure in the feature for considering it good for characterizing a given cluster?
I am looking for a more analytical characterization of these features. For example, if a feature f have a high standard deviation in the whole dataset, but a small standard deviation within a cluster c, does this means that this feature is important for distinguishing the cluster c?
There are two approaches you could use here:
A feature selection approach would be to remove the said feature and redo the clustering and see if it had strong effect, if no you could say this feature is unnecessary for the clustering task. The down side of this approach is the time it would take to run the clustering process for each subset of features in the dataset.
A statistical approach would be to split the data into two groups: the samples from the cluster and the rest of the samples. Then you ask how different are the feature values when comparing the two populations. Depends on the distribution of this feature, you could pick for this task a test like KS test, t test, chi-squared test or any other test for comparing distributions of two samples.

Feature weightage from Azure Machine Learning Deployed Web Service

I am trying to predict from my past data which has around 20 attribute columns and a label. Out of those 20, only 4 are significant for prediction. But i also want to know that if a row falls into one of the classified categories, what other important correlated columns apart from those 4 and what are their weight. I want to get that result from my deployed web service on Azure.
You can use permutation feature importance module but that will give importance of the features across the sample set. Retrieving the weights on per call basis is not available in Azure ML.

Resources