I have 2 two datasets that record temperature recieved from a sensor and the time (down to the second) each is recieved. I am evaluating unsing LSTM as one option to predict future temperatures. The intervals between each record are very irregular.
The first dataset has the following characteristics:
Time period: 1 day. Number of samples: 2026. Max interval in seconds: 4377. Min interval in seconds: 0. Average interval in seconds: 41.17. 93% of intervals are under one minute, 85% are under 30 seconds and 80% are under 20 seconds.
The second dataset has the following characteristics:
Time period: 3 days. Number of samples: 1383. Max interval in seconds: 23976. Min interval in seconds: 0. Average interval in seconds: 184.93. 89% of intervals are under one minute, 79% are under 30 seconds and 74% are under 20 seconds.
Given the irregular nature of the intervals, what is the best way to prepare the data as an input for a LSTM model? Leave as is? Take the average temperature over fxied time intervals? Remove extreme values? Other options?
Related
GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.
With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.
I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.
I got the following datasets, and the data is in 15 mins interval:
Time A B A+B
2021-01-01 00:00 10 20 30
2021-01-01 00:15 20 30 50
2021-01-01 00:30 30 40 70
2021-01-01 01:00 40 50 90
2021-01-01 01:00 10 20 30
2021-01-01 01:15 20 30 50
2021-01-01 01:30 30 40 70
2021-01-01 02:00 40 50 90
Basically I need to develop a machine learning model for predicting the hourly A+B
Time A+B
2021-01-02 00:00
2021-01-02 01:00
2021-01-02 02:00
2021-01-02 03:00
I want to ask when selecting my target label for my training model
Should I use 15 mins data for training and add the result afterward for hourly A+B or should I aggregate the 15 mins data into hourly data for training? What is the difference?
Is there any difference if I try to train A and B separately and add them up comparing with training A+B directly?
Thanks a lot
Here is a possible solution. Since you care about the hour total and you get data every 15 mins I would give the 15 minute interval data for a whole hour as the input to the network. Then the output would be the final value at the end of that hour.
so for example the input the net would be shape [4,2] this would be the A and B values. The output will be the final result after the hour.
On another note this doesn't sound like a problem that needs machine learning but I'm sure there is more info i dont know about
I would first split the data into a training and validation set as-is.
Then take a third option of using a sliding window of 1 hour over the samples in each set to produce data at one hour intervals. This will create 3x more valid training samples than simple aggregating.
Whether to build a model of A, B or A+B depends on what you want to predict. Do you need predictions for A and B separately? or do you only need A+B? If you only want A+B, then build the model around that. Any basic ML model will be able to handle the summation, so it will likely not make significant difference. As with most data-driven problems, it will depend on the data, so if you really want to find out if there is a difference for your data, you may want to try both, and compare results on a hold-out set.
I want to process the value from InfluxDB on Grafana.
The final demand is to show how many miles the current vehicle has traveled in a certain time frame.
You can use the formula: average velocity * time.
Do the seniors have any good methods?
So what I'm thinking is: I've got the mean function for the average speed over a fixed period of time and the corresponding mileage, and then I want to add all the mileage together. How do I do that?
What if you only use SQL?
1.) InfluxDB uses InfluxQL, not a SQL
2.) Your approach average velocity * time is innacurate
3.) Use suitable InfluxDB functions, I would say INTEGRAL() is the best function for this case + some basic arithmetic. Don't expect the 100% accuracy. Accuracy depends heavily on the metric sampling, e.g. 1 minute sampling - but what if vehicle is driving 59 seconds and it is not moving for that second when sampling is happening. So don't be supprised, when even 10 sec sampling will be inacurrate.
What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?
A local peak is a point that is larger than its previous and next datapoint. (So the current time is definitely not a local peak)
(p: peak point, i: cornjob interval, m: missed execuation)
I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.
Use z-score to detecting anomalies
If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.
Calculate the average and standard deviation for the metric using data with large sample size.
# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.
# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) / stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])
Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.
If what you want is an alert to be fired when the elapsed time has been longer than a fixed duration, you can set an alert similar to the up alert, based on the changes > 0 expression, which is only true (i.e. > 0) when the job is running.
An example would be:
rules:
- alert: CronJobNotRunning
expr: |
changes(
sum(
rate(
cronjob_duration_time_seconds_count{
status="ok", namespace="<namespace>", exported_job="<job>"
}[1m]
)
)[1m:]
) == 0
for: <alert_duration>
Note that subqueries ([1m:]) are expensive, and introducing a recording rule there can help performance, especially in a dashboard.
Also, in your case, the time since the last time the second derivative was non-zero can be used too, as that happens when a job starts/finishes (the drops in the graph, or when it starts to rise).
I'm a little bit confusing about how to format my model for variable time interval events classification with lstm.
I'm trying to classify events in time dimension. The events occur with different intervals (e.g., Event1 and Event2 have interval of 3 seconds, Event2 and Event3 have interval of 1.68 seconds). Each event has same number of features to describe its characteristics.
The confusing part is, the intervals between these events are also critical for the classification problem. For example:
1) if Event1 and Event2 both have feature x = 0, and the interval is less than 1 second, these series of events would be classified as class 1;
2) if Event1, Event2 and Event3 all have feature x = 1, and the intervals are longer than 2 seconds, these series of events would be classified as class 2.
I searched a lot of questions related to rnn time series problems and found nothing healpful. So anyone has good suggestions on formatting my data and model to deal with this problem?