Periodic snapshot fact table with large dimensions

Periodic snapshot fact table with large dimensions - data-warehouse

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?

You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

Related

SELECT queries performance impact when the Clickhouse table is continuously populated with INSERT INTO

The Clickhouse table, MergeTree Engine, is continuously populated with “INSERT INTO … FORMAT CSV” queries, starting empty. The average input rate is 7000 rows per sec. The insertion is happening in batches of few thousand rows. This has severe performance impact when SELECT queries are executed concurrently. As described in the Clickhouse documentation, the system needs at most 10 minutes to merge the data of a specific table (re-index). But this is not happening as the table is continuously populated.
This is also evident in the file system. The table folder has thousands of sub-folders and the index is over-segmented. If the data ingestion stops, after a few minutes the table is fully merged, and the number of sub-folders becomes a dozen.
In order to encounter the above weakness, the Buffer Engine was used to buffer the table data ingestion for 10 minutes. Consequently, the buffer maximum number of rows is on average 4200000.
The initial table is remaining at most 10 minutes behind as the buffer is keeping the most recently ingested rows. The table is finally merged, and the behaviour is the same as in case where the table has stopped to be populated for a few minutes.
But the Buffer table, which corresponds to the combination of the buffer and the initial table, is getting severely slower.
From the above appears that, if the table is continuously populated, it is not merging, and indexing suffers. Is there a way to avoid this weakness?

The number of sub-folders in the table data directory is not so representative value.
Indeed, each sub-folder contains a data part consisting of sorted (indexed) rows. If several data parts are merged into a new bigger one the new sub-folder appears.
However, source data parts are not removed instantly after the merge. There is a <merge_tree> setting old_parts_lifetime defining a delay after which the parts will be removed, by default it set to 8 minutes. Also, there is cleanup_delay_period setting defining how often a background cleaner checks and removes outdated parts, it is 30 seconds by default.
So, it is normal to have such amount of sub-folders for about 8 minutes and 30 seconds after the ingestion starts. If it is unacceptable to you, you can change these settings.
It makes sense to check the amount of active parts in a table only (i.e. parts which have not been merged into a bigger one). To do so, you could run the following query: SELECT count() FROM system.parts WHERE database='db' AND table='table' AND active.
Moreover, ClickHouse does such checks internally if the amount of active parts in a partition is greater than parts_to_delay_insert=150, it will slow down INSERTs, but if it is greater than parts_to_throw_insert=300 it will abort insertions.

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.

PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.

Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.

First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.

This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited

Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Time and date dimension in data warehouse

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)

Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.

My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.

Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.

I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.

Data warehouse reporting questions

I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!

The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.

Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart