How to learn variable time interval events with lstm - time-series

I'm a little bit confusing about how to format my model for variable time interval events classification with lstm.
I'm trying to classify events in time dimension. The events occur with different intervals (e.g., Event1 and Event2 have interval of 3 seconds, Event2 and Event3 have interval of 1.68 seconds). Each event has same number of features to describe its characteristics.
The confusing part is, the intervals between these events are also critical for the classification problem. For example:
1) if Event1 and Event2 both have feature x = 0, and the interval is less than 1 second, these series of events would be classified as class 1;
2) if Event1, Event2 and Event3 all have feature x = 1, and the intervals are longer than 2 seconds, these series of events would be classified as class 2.
I searched a lot of questions related to rnn time series problems and found nothing healpful. So anyone has good suggestions on formatting my data and model to deal with this problem?

Related

Google Vertex Auto-ML Forecast every 30 mins predict 2 hours

GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.
With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.
I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.

Calculate the InfluxDB average

I want to process the value from InfluxDB on Grafana.
The final demand is to show how many miles the current vehicle has traveled in a certain time frame.
You can use the formula: average velocity * time.
Do the seniors have any good methods?
So what I'm thinking is: I've got the mean function for the average speed over a fixed period of time and the corresponding mileage, and then I want to add all the mileage together. How do I do that?
What if you only use SQL?
1.) InfluxDB uses InfluxQL, not a SQL
2.) Your approach average velocity * time is innacurate
3.) Use suitable InfluxDB functions, I would say INTEGRAL() is the best function for this case + some basic arithmetic. Don't expect the 100% accuracy. Accuracy depends heavily on the metric sampling, e.g. 1 minute sampling - but what if vehicle is driving 59 seconds and it is not moving for that second when sampling is happening. So don't be supprised, when even 10 sec sampling will be inacurrate.

Prometheus query for last local peak value

What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?
A local peak is a point that is larger than its previous and next datapoint. (So ​​the current time is definitely not a local peak)
(p: peak point, i: cornjob interval, m: missed execuation)
I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.
Use z-score to detecting anomalies
If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.
Calculate the average and standard deviation for the metric using data with large sample size.
# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.
# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) / stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.
If what you want is an alert to be fired when the elapsed time has been longer than a fixed duration, you can set an alert similar to the up alert, based on the changes > 0 expression, which is only true (i.e. > 0) when the job is running.
An example would be:
rules:
- alert: CronJobNotRunning
expr: |
changes(
sum(
rate(
cronjob_duration_time_seconds_count{
status="ok", namespace="<namespace>", exported_job="<job>"
}[1m]
)
)[1m:]
) == 0
for: <alert_duration>
Note that subqueries ([1m:]) are expensive, and introducing a recording rule there can help performance, especially in a dashboard.
Also, in your case, the time since the last time the second derivative was non-zero can be used too, as that happens when a job starts/finishes (the drops in the graph, or when it starts to rise).

Forecasts Machine Learning

This is a follow up question from my other question . So, I'm making an Machine Learning Model to forecast when some things happen. I will use softmax as output.
My question is, is it better to use 7 output nodes ( range from sunday - saturday, i.e. for data on monday, then the computer predict some things will happen in friday) or 0....n output nodes ( as in day interval since day h )?
If the weekday doesn't have to do something with your data, it's defenetly better to use the 0....n outputnodes since day n.
In that case, which differs from what you asked last time, a single neuron with relu as output might be even better. (This time the weekday seems not to play a role, so you try not to classify the weekday (classification - discrete), but want to know the time to the next event (regression - continuous), that also could be 3.54 days).
Classification: Softmax
Regression: Single Neuron with relu/linear/...

Time series normalization, how to handle zeros

I'm working on a player churn prediction model for a game. I have number of rounds played time series for 60 days. Before I feed the time series to classification algorithms, I need to normalize the time series.
I was thinking about using min-max normalization by transform x to x/Max(x). Max(x) in the 60 days time series doesn't necessarily captures the peak of how many times a player usually play a day.
But the z-normalization by transform x to (x-mean(x))/std(x) will not work since I need to preserve the information of the days with no play is zero. Doing z-normalization maps 0 to different values which makes them uncomparable.
Is there a normalization scheme which requires no information about the maximum of the time series and can map 0 still to 0?
You can transform your values into probabilities by dividing each value in the array by the sum of the values in the array (normalization factor "sum to unity").i.e.transform x to x./sum(x)
That would map 0 values to 0 and requires no information about the maximum value.

Resources