Google's AutoML time series forecasting: understanding data exported to BigQuery during training - time-series

When training a time series forecasting model, I checked the option to "Export test dataset to BigQuery." I'm having a hard time understanding the meaning of the "predicted_on" timestamps that appear in the BigQuery table.
Some info about my model: the granularity is weekly. The context window is 26 weeks, and the forecast horizon is 26 weeks. The 10% test data split also contains exactly 26 weeks of data. In our training data, we have a submission_week column which is designated as the "timestamp" column.
In the BigQuery table, I see the submission_week column. It starts
on 06/05/2022, which is the first date of the 10% test data split.
The BigQuery table also contains a predicted_on_submission_week
column. (This is the column which I don't understand.)
When I sort the BigQuery table by submission_week and then predicted_on_submission_week, it looks like this:†
predicted_on_submission_week / submission_week
06/05/2022 06/05/2022
---
06/05/2022 06/12/2022
06/12/2022 06/12/2022
---
06/05/2022 06/19/2022
06/12/2022 06/19/2022
06/19/2022 06/19/2022
† (Note that for each row above, there actually are multiple rows in the BigQuery table - one for each time series.)
The pattern seen above proceeds until there are at most 6 predicted_on_submission_week timestamps for every submission_week timestamp.
My questions:
What is the meaning of the predicted_on_submission_week timestamps? Why are there multiple (at most 6) such timestamps for each submission_week timestamp?
(I suspect this may be related to how the context window and forecast horizon are used during training and forecasts as described here in Google's documentation, but I'm not sure...)

Related

How Calculate change rate over time with tabular dataset in google data studio?

I seek your valuable support in finding a way to calculate change rate over time with tabular dataset in google data studio. Here is the link to the dataset: https://docs.google.com/spreadsheets/d/1To1n5JJA6uVkLMgwjKhghJgCJpFmtXkqNog4DzfoEbE/edit?usp=sharing
There are many rows with data stamp and have different categories and sub categories. I have created a change rate table manually based on which I want to create charts in google data studio. The charts will be from the raw tabular data not the separate change rate table that is built only for example purpose.
So the chart could be based on a main category (as in the sample) and can also be viewed as sub-category and show change rate over time between the dates.
The dates can sometimes be months or years. I am not very savy with advanced formulas or scripting but I am hopeful someone here would be able to help me out on this. I will be ever so grateful for this :)
I can only provide you with the quotient of datasets between two days. If you need different mappings between dates (day, months, years), for each the following steps have to be done:
generate a new field "yesterday" with: DATETIME_SUB(date, INTERVAL 1 day)
blend this dataset with itself, using as dimension "date" and "yesterday".
Further dimensions are your categories fields A and B.
As metric, you can use the count of the date field.

Machine learning model with varying input shape as time changes

I am trying to predict the bookings of a stand-up comedian cafe. There are a lot of features I can use which have an affect on the number of sales. (e.g. day of the year, weather, average sales last month, day of the week, average sales on the specific day of the week etc.)
However, one of the features that most correlates with the actual number of sales is the number of tickets already sold before the deadline. The customers are able to start making reservations 120hours (5 days) before the actual deadline of ordering (11:00 AM on the same day of the show).
I would prefer to use this data as input for my machine learning algorithm. Currently I created 120 columns in the dataframe. The columns define 120 hours before deadline untill the deadline itself. Column "hour_98" therefore shows the accumulated sales 4 days before the deadline. Column "hour_24" shows the accumulated sales 24 hours before deadline etc.
If I now would like to predict the sales 24 hours before deadline the columns "hour_24" until "hour_0" are all given "NaN" values. Since algorithms can't deal with NaN values I currently give these columns a value of 0. However, I tihnk this is too simplistic and will result in bad prediction model.
How do we deal with a changing input shape since we obtain more data if we get closer to the deadline of ordering?
Now from what I understand, you have a fixed number of columns, each representing the data from a predefined hour before the deadline. So in a sense the input data shape never changes, only the validity of some input features changes.
Provided you have a fixed input shape, with changing validity of the features (NaNs),
you can get around that issue by using a mask for each input feature.
For example a valid hour_24 can be represented as hour_24 = 20 and mask_24 = 1, and an invalid hour_24 can be represented as hour_24 = 0 (or whatever) and
mask_24 = 0.
The algorithm itself will need to learn where to ignore a given feature in respect to the related feature's mask.
This answer explains in more detail how to mask input.

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.
PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.
Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

Periodic snapshot fact table with large dimensions

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?
You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Resources