Time series database that computes integral - time-series

I have a data sets that contains the following information:
Device # | Timestamp | In Error (1=yes, 0=false)
1 | 1459972740 | 1
1 | 1459972745 | 1
1 | 1459972750 | 0
1 | 1459972755 | 1
2 | 1459972740 | 0
2 | 1459972745 | 1
2 | 1459972750 | 1
2 | 1459972755 | 1
...
I would like to compute the number of minutes a device has been in error in a specific period. ie: "How much downtime (in minutes) did we had per device yesterday". Which would lead to "What is our device with the most downtime yesterday", "What is our average error time per device per day", etc
I would assume that this is a classic use case for time series but I can't find any product that can compute integral aggregation on this dataset. Note that the engine must be able to assume a value based on the previous snapshot. In my example, if I request the downtime per device between 1459972742 and 1459972752, the output should be 8ms for device #1 and 7ms for device #2.
Thanks!

VictoriaMetrics provides integrate() function, which can be used for calculating the integral over the given duration. For example, the following MetricsQL query calculates the integral for the given metric over the last 24 hours:
integrate(metric[24h])

Axibase Time Series Database provides both the API and visualization for threshold aggregation functions that can compute SLA/outage metrics.
Grafana: How to have the duration for a selected period
In your case threshold would be 1:
"threshold": {
"max": 1
}

Related

machine learning model different inputs

i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.
Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further

Description matching in record linkage using Machine learning Approach

We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.

Is there a Feature Outline?

In my application, the set of tests for an Estimate and Invoice are very similar. I can use the Scenario Outline and Examples to repeat a test with these types. But how do I repeat all the tests within a feature with examples and not repeat the examples at every scenario outline?
For example, is there a way I can rewrite the tests below without having the state the examples twice?
Scenario Outline: Adding a sales line item
Given I have a <Transaction>
And Add Hours of quantity 2 and rate 3
When I save
Then the total is 6
Examples:
| Transaction |
| Invoice |
| Estimate |
Scenario Outline: Adding two sales line item
Given I have a <Transaction>
And Add Hours of quantity 2 and rate 3
And Add Hours of quantity 5 and rate 2
When I save
Then the total is 16
Examples:
| Transaction |
| Invoice |
| Estimate |
In other words, is there such a thing called, for a lack of a better, Feature Outline?
Unfortunatelly Gherkin language does not support anything like this

Measuring periodicity strength of a specific time on the time series data

I try to measure periodicity strength of a specific time on the time series data when a period (e.g., 1day, 7day) is given.
For example,
| AM 10:00 | 10:30 | 11:00 |
DAY 1 | A | A | B |
DAY 2 | A | B | B |
DAY 3 | A | B | B |
DAY 4 | A | A | B |
DAY 5 | A | A | B |
If a period is 1 day, AM 10:00 and 11:00 is the highest strength of periodicity in this data because there are consistent value in both times.
Are there any popular method or research to do this?
There are many existed research for finding periodic pattern in the time series, but I can't find research measuring periodicity strength of a specific time when a period is given.
Please sharing your knowledge. Thanks.
What you are looking for is something called cyclic association rules. I've linked to the paper that was originally written by researches at Bell Labs.

Backpropagation: when to update weights?

Could you please help me with a neural network?
If I have an arbitrary dataset:
+---+---------+---------+--------------+--------------+--------------+--------------+
| i | Input 1 | Input 2 | Exp.Output 1 | Exp.Output 2 | Act.output 1 | Act.output 2 |
+---+---------+---------+--------------+--------------+--------------+--------------+
| 1 | 0.1 | 0.2 | 1 | 2 | 2 | 4 |
| 2 | 0.3 | 0.8 | 3 | 5 | 8 | 10 |
+---+---------+---------+--------------+--------------+--------------+--------------+
Let's say I have x hidden layers with different numbers of neurons and different types of activation functions each.
When running backpropagation (especially iRprop+), when do I update the weights? Do I update them after calculating each line from the dataset?
I've read that batch learning is often not as efficient as "on-line" training. That means that it is better to update the weights after each line, right?
And do I understand it correctly: an epoch is when you have looped through each line in the input dataset? If so, that would mean that in one epoch, the weights will be updated twice?
Then, where does the total network error (see below) come into play?
[image, from here.]
tl;dr:
Please help help me understand how backprop works
Typically, you would update the weights after each example in the data set (I assume that's what you mean by each line). So, for each example, you would see what the neural network thinks the output should be (storing the outputs of each neuron in the process) and then back propagate the error. So, starting with the final output, compare the ANN's output with the actual output (what the data set says it should be) and update the weights according to a learning rate.
The learning rate should be a small constant, since you are correcting weights for each and every example. And an epoch is one iteration through every example in the data set.

Resources