Store machine status on Graphite time-series to later extract KPIs

Store machine status on Graphite time-series to later extract KPIs - time-series

having a machine which sends (not regularly) its status values 0, 1, 2, we're storing it in Graphite. Now the status means:
0 - stopped
1 - working
2 - stopped by anomaly
The requested KPIs to extract are the classical ones: how much time on status 0 or 1 or 2 in a day or a week? Before reinventing the wheel, we're looking at the best way to compute those PKIs and if in Graphite (or possible other time-series solution) there are already function which deal with summing the time where the data point value is just a condition. Clearly the time intervals to sum are not stored, it's the time elapsed between a data point and the next one.
Or should the data pre-processed to compute the time intervals and then store three data sets like: status.working, status.stopped, status.alarm and for each store when the specific "event" started and how much it lasted?
There are other KPIs, for example the number of alarms in a day. Receiving two status data points in a row both indicating status "2" is actually a single alarm condition and must count as 1.
So, is there a best way to store such data without pre-processing it? It sounds to be a common pattern but (shame on us?) we have not found this topic well explored.
Thanks.

Graphite has a number of functions that could help you here. One that stands out is the summarize() function in which you can pass an aggregation method (in this case sum) and a duration in minutes/hours/days/weeks/etc), take a look here
isNonNull is another useful function: it can be used to determine the existence of a datapoint regardless of the value.
When you say that the machie reports a value 0 to indicate it has stopped - does it actually send that value or does it report nothing? This is an important detail and will have some bearing on the end result of your solution.

Related

Influx index and high cardinality

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido

As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).

Your query is correct. The result will be NaN if there have been no queries in the past minute.

Predicting possible inputs leading to output satisfying certain condition

Suppose there is a data set of statistical data with a number of input columns and one output column. The predictors characterize some particular process that is repeated, so one data row is corresponding to one occasion of that process. And for these process characteristics the order and duration is important. Some of them might be absent at all, some of them are repeated, but with different speed or other parameter.
Let's say that our process is names P and it can have a lot of child parts, that form the process together. Let's say, once the process had N sub processes:
Sub process 1, with: speed = SpdA, duration = DurA, depth = DepA
Right after sub process A next sub process B happened:
Sub process 2, with: speed = SpdB, duration = DurB, depth = DepB
...
... N. Sub process N.
So there might be from 1 to N child processes in each process, that is, in each data row. And the amount of the child processes may vary from one row to another. This is about the input data.
As for the output - the output here in the simplest case is binary - either success or failure, but in reality it will be a positive number starting from 0 to positive infinity. This number represents the time by which the process has finished successfully. If the value for the output is a positive infinity - it means that the process failed to succeed.
Very important note, if we are going with the simplest case where the output is binary - in the statistical data set there will be data rows that mostly have failure in the output. The goal is to find the hypothetical parameters that values of the test predictors should be equal to, to make the process succeed.
For example, after learning we should be able to tell what is the concrete universal input parameters that will most process success. That was the simplest, binary output case.
However, in real life we will have the output that represents time by which the process finished successfully, and +infinity - if failure. So here the goal is the same - make the process succeed or as much close to success as possible. The goal is to generate the test inputs that we might use in future to prevent the output equal to +infinity.
The goal maximum is, having the target time provided, find the exact values for the inputs that will make the process finish successfully as closer to the given time as possible. Here we should expect the enumeration of child processes, their order and the values for each child process to be predicted.
Here in this problem, I guess, the output will play the role of the input and the input will play the role of the output.
What is the approach to solve these problems? How to handle the variable number of characteristics and how to handle the order that might vary in the each data row?
I am a novice in machine learning and would appreciate the concrete suggestions or examples of similar problems solved.
Any help and advice welcome!

Why the time of a signal is an independent variable

Can any one please explain clearly why the time of a signal is an independent variable while the amplitude is a dependent one? I referred to some results from google but i coul not figure it out.

the raw signal what ever it is measuring it is a function of time "time-domain" which means if we plotted the "time-domain" we will get one axes for the time (t), which is independent, and another axes for the Amplitude (x(t)) which is dependent variable on the time.
Note that: the independent variable "time" could be continous or discrete. Continuos means the time could be represented as intervals eg: t=(0 -> 800). while the discrete time signal could be represented as a countable set, eg: t = (1/2,5/2,/8/2).
Also, if you have a signal with the independent variable represents the TIME, then this signal is multidimensional "more than one dimention"

Strange question. Definitely more philosophical than programming-related. Here's my view.
One explanation is that a signal is a (mathematical) function of time. That means that for each time you have one and only one amplitude value. In contrast, the same amplitude value could be found at several (or none) time instants. So if you considered amplitude as independent variable and time as dependent of amplitude, the relationship wouldn't be a function. It's easier to ask something whose answer is known to be unique (amplitude obtained at a given time) than it is to ask something that might have none, one, or arbirarily many answers (time instants corresponding to a given ampitude level).
Also, psychologically we are more often interested in finding out "what the signal value is at a given instant", as opposed to knowing "at which instants a given signal value is found". For example, questions of the type "what will the weather be like tomorrow?" are more common than "on which days from now on will the weather be sunny?". So the point of view of time as independent and amplitude as dependent on time seems more natural.

The time is a universal independent variable because nothing can change the time. On multiple time instance there can be the same value of amplitude. But on two amplitudes there can not be one time. Independent variables are those which can not be changed with respect to other parameter.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.

Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart