Merging data from dimensions of different grains - data-warehouse

I am using an RDBMS and writing an equivalent of an ETL program, though without tools like Informatica etc.
The source data comes from three different tables, each with a different level of data storage, its all per account, but the SCD-type-2 behavior makes number of rows variable. Also, one of the sources has data on a "per day" basis, i.e. one row per day.
The need is to merge this data (finally 5 attributes) into a single table, to facilitate lookup on account number and date. Effectively providing a lookup service to the extent of "whats the value of this attribute on this day" for a given account.
The challenge primarily is on leveling the "grain" of the records. Couple of brute force ideas are there. One is to explode the variable grain rows, and generate additional rows to level with the lowest grain attribute. Effectively having one row per day for each attribute. This doesn't look clean to me and will consume much more storage too. Here's an example -
Source -
table 1 (Customer Details)
Address - dt1 - dt2 - val1
Address - dt2+1 - dt3 - val2
Address - dt3+1 - infinity - val3
table 2 (Loan Details)
Maturity Date - dt4 - dt5 - val8
Maturity Date - dt5+1 - dt6 - val-x
Maturity Date - dt6+1 - dt7 - val-xx
Maturity Date - dt7+1 - infinity - val-y
table 3 (Account Balance) (one record per day)
Daily Interest Accrued - dt1 - val1
Daily Interest Accrued - dt2 - val2
Daily Interest Accrued - dt3 - val3
Target
dt1 - Address-val - Maturity Date-val - Daily Interest Accrued-val
dt2 - Address-val - Maturity Date-val - Daily Interest Accrued-val
dt3 - Address-val - Maturity Date-val - Daily Interest Accrued-val
These three attributes need to be stored in a single table... ideas please..

Let's deal with the modelling, first.
In the absence of any more details, I think you've got a periodic snapshot for ACCOUNT-BALANCE, a DATE dimension, a Type-2 slowly changing dimension for CUSTOMER, and a Type-1 dimension for LOAN.
The attributes of FACT_ACCOUNT look like DIM_DATE_ID, DIM_CUSTOMER_ID, DIM_LOAN_ID, ACCRUED_INTEREST, ACCOUNT_BALANCE.
The reason I don't think you've got a Type-2 dimension for LOAN is because I don't see its values - such as account number, maturity date and original balance - changing over time.
By moving the ACCOUNT_BALANCE from the dimension to the fact table you've got a better representation of the process.
One question which will probably arise is the storage of an interest rate. A fixed rate would be an attribute of an SCD1 dimension. A periodically changing rate could be an SCD2 value. If it changed at the same grain as the fact table (ie, daily) I'd make it a non-additive measure.
I did see your point about storing three attributes in one table, but I don't see its purpose. If the attributes that are needed to satisfy a query are in different tables, that is the role of a JOIN. Any competent visualisation or analytics tool is going to support simple joins like this.

Related

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.
PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.
Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.
First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.
This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited
Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Calculate running total rails 3

What's the best way - or, indeed, any way - to calculate a running total in Rails?
I have a model, Sale. It has a quantity column and a sales_value column. I need to populate a third column, total_quantity, with the sum of the quantity values of the previous records, when the table is sorted by isbn_id, then channel_id, then invoice_date. This sets all sorts of sensible database management alarm bells ringing, so I'm wondering if it's even possible.
The reason for needing this cumulative sum is to apply a percentage to the sales where the cumulative quantity is within a particular range. I can't use an average sales value across all records, because the margin on sales can vary dramatically over time - so I'd apply an average to a bunch of sales which might over or under pay the royalty payee.
So. Should I do a before_save callback on the Sale model, and update_attribute, somehow? Is there a method to return the value of the previous record when the table is sorted in a particular way? Or should I dump all Sale records into an array and maybe use inject to accumulate the running total?
Any ideas most welcome, thanks in advance.
Update: subsequent question asked here.
Do not use inject (srsly). The best way to do this is to use the SQL group commands and/or the Calculations methods in activerecord (like sum)
http://ar.rubyonrails.org/classes/ActiveRecord/Calculations/ClassMethods.html

Availability Scheme in database

I'm currently designing a website which can help my rowing team plan training times and such. The basic idea is that every rower can set the times they can train. Coaches can then see the availability of all the rowers in a handy table and can use this to plan a training.
My question is, how should I represent availability in the class diagram and database?
The idea that I had was to divide days into time blocks: Block 1 stands for 7:00 - 7:30, block 2 stand for 7:30 - 8:00. Then I will create a table 'timeblocks' with the following attributes:
block_id
user_id
date (day, month and year)
block_number
availability
Is this a efficient way of storing availability data?\
Another way you can normalization this table into two piece. a special block table and availability table.
block :
Block_id
block_range
Time_Block
Time_blockId
Block_ID
user_ID
Date
Availability

Data warehouse reporting questions

I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!
The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.
Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.

Resources