Is there a concept of slowly changing FACT in data warehouse - data-warehouse

In data warehousing, we have the concept of slowly changing dimensions. I am just wondering why there is no jargon for 'slowly/rapidly changing FACTs' because the same Type1, Type 2 measures can be used to track changes in the FACT table.

According to the DW gods there are 3 types of FACT tables
Transaction: your basic measurements with dim references. measurements not rolled up or summarized, lots of DIM relations
Periodic: Rolled up summaries of transaction fact tables over a defined period of time.
Accumulating Snapshot: measurements associated with a 2+ defined time periods
From these we have at least 2 options that will result in something pretty similar to a slowly changing fact table. It all depends on how your source system is set up.
Option 1: Transactional based Source System
If your source system tracks changes to measurements via a series of transactions (ie, initial value + change in value + change value etc) then each of these transactions ends up in the transactional fact. This is then used by the periodic fact table to reflect the 'as of period' measures.
For example, if your source system tracks money in and out of an account you would probably have a transaction fact table that pretty much mirrored the source money in/out table. You would also have a periodic fact table that would be updated every period (in this case month) to reflect the total value of the account for that period
The periodic fact table is your Slowly Changing Fact table.
Source DW_Tansaction_Fact DW_Periodic_Fact
--------------- -> ------------------- -> --------------------
Acnt1 Jan +10$ Acnt1 Jan +10$ Acnt1 Jan 10$
Acnt1 Feb -1 $ Acnt1 Feb -1 $ Acnt1 Feb 9$
Acnt1 Apr +2 $ Acnt1 Apr +2 $ Acnt1 Mar 9$
Acnt1 Apr 11$
Option 2: CRUD/Overwriting Source System
Its more likely you have a source system that lets users directly update/replace the business measurements. At any point in time, according to the source system, there was and is only one value for each measure. You can make this transaction by some clever trickery in your ETL process but your only ever going to get a transaction window limited by your ETL schedule.
In this case you could go either with a Periodic Fact table OR an Accumulating fact table.
Lets stick with our account example, but instead of transactions the table just stores an amount value against each account. This is updated as required in the source system so that for Acnt1, in January it was 10$, February 9$ and April 11$
Sticking the the transaction and period fact tables we would end up with this data (As at end of April). Again, The periodic fact table is your Slowly Changing Fact table.
DW_Tansaction_Fact DW_Periodic_Fact
------------------- -> --------------------
Acnt1 11$ Acnt1-Jan-10$
Acnt1-Feb-09$
Acnt1-Mar-09$
Acnt1-Apr-11$
But we could also go with with an Accumulating Fact table which could contain all month values for a given year.
DW_Accumlative_Fact_CrossTab
Year Acnt Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 Acnt1 10 9 9 11 - - - - - - - -
Or a more type3-ish version
DW_Accumlative_Fact_CrossTab
Acnt Year YearStartVal CurrentVal
Acnt1 2001 10 9
Kindof relevant
In my experience, this sort of question comes up when this common business scenario:
There is a Core Business System with a DATABASE.
Business Periodically Issues Reports that summaries values by time periods from Core Business System
Core Business System allows retrospective updating of data - This is handled by overwriting values.
Business demands to know why the January figures in the same report run in June no longer match the January figures from the report run in February.
Note that you are now dealing with FOUR sets of time (Initial period of report, measurement at date of initial period, current report period, measurement at current period) which will be hard enough for you to explain let alone your end users to understand.
Try to step back, explain to your end users which business measures change over time, listen to what results they want and build your facts accordingly. Note that you may end up with multiple fact tables for the same measure, that is OK and good.
Reference:
http://www.kimballgroup.com/2008/11/fact-tables/
http://www.zentut.com/data-warehouse/fact-table/

Related

SQLite- Converting from numbers to date. Digital Forensics

Good Afternoon,
I want to make my life easier by querying SQLite databases I find on mobile devices, as opposed to manually putting the values in MFT Stampede.
I have a database with 70 tables that I extracted from an iOS device. There is a table I have a particular interest in which keeps a record of all images that have been stored on the iOS device in a particular directory I'm interested in. I ran a timestamp number "680204956.051849" in MFT and got the "MAC (CF) Absolute Time of "Fri, 22 Jul 2022 17:49:16". I ran a query to extract all the dates:
SELECT
datetime(ZADDEDDATE, 'unixepoch')
FROM ZASSET
LIMIT 15;
For the same field I ran in MFT, I get "1991-07-22 17:49:16". The year is wrong, any idea how I can get the correct year?

Fact Table Design - How to capture a fact which precedes the data start date

We have a fact table which collects information detailing when an employee selected a benefit. The problem we are trying to solve is how to count the total benefits selected by all employee's.
We do have a BenefitSelectedOnDay flag and ordinarily, we can do a SUM on this to get a result, but this only works for benefit selections since we started loading the data.
For Example:
Suppose Client#1 has been using our analytics tool since October 2016. We have 4 months of data in the platform.
When the data is loaded in October, the Benefits source data will show:
Employee#1 selected a benefit on 4th April 2016.
Employee#2 selected a benefit on 3rd October 2016
Setting the BenefitSelectedOnDay flag for Employee#2 is very straight forward.
The issue is what to do with Employee#1 because we can’t set a flag on a day which doesn’t exist for that client in the fact table. Client#1's data will start on 1st October 2016.
Counting the benefit selection is problematic in some scenarios. If we’re filtering the report by date and only looking at benefit selections in Q4 2016, we have no problem. But, if we want a total benefit selection count, we have a problem because we haven’t set a flag for Employee#1 because the selection date precedes Client#1’s dataset range (Oct 1st 2016 - Jan 31st 2017 currently).
Two approaches seem logical in your scenario:
Load some historical data going back as far as the first benefit selection date that is still relevant to current reporting. While it may take some work and extra space, this may be your only solution if employees qualify for different benefits based on how long the benefit has been active.
Add records for a single day prior to the join date (Sept 30 in this case) and flag all benefits that were selected before and are active on the Client join date (Oct 1) as being selected on that date. They will fall outside of the October reporting window but count for unbounded queries. If benefits are a binary on/off thing this should work just fine.
Personally, I would go with option 1 unless the storage requirements are ridiculous. Even then, you could load only the flagged records into the fact table. Your client might get confused if he is able to select a period prior to the joining date and get broken data, but you can explain/justify that.

How to show multiple Date/Times per location?

Using Google Spreadsheets, I need to enter data structured like the example below.
There will be multiple "quadrants"
Each "quadrant" can contain one or many "days",
Each "day" can contain one or many "times".
This data will ultimately be imported in some backend db (e.g. Access DB, SQL, MySQL).
Question: For each day, how do I represent multiple times? Do I create a new row?
Quadrant One Team Schedules
Sunday
10:00 AM - Red Team
3:00 PM - Green Team
Monday
6:00 AM - Red Team
10:00 AM - Yellow Team
3:30 PM - Green Team
Tuesday
Wednesday
6:00 PM - Yellow Team
Thursday
1:00 PM - Red Team
Friday
Saturday
10:00 AM - Blue Team
3:00 PM - Red Team
I’m not quite sure what answer you are expecting but wanting to post an image (and probably length!) is why this is not a comment.
Poor data layout that requires changes to help legibility or changes to facilitate further processing is, IMO, a very big issue – much more so than, it seems, is appreciated by novices (see perhaps Kruger-Dunning). Again merely my opinion, but I think about half of all questions on SO have data layout as an issue, in whole or part.
Some suggestions:
With databases, always have an index (ID) to identify unique records (rows). Often added automatically.
Try to ensure each record is complete for every field (nulls may cause issues). ID6 seems not required.
Use dates rather than days of the week (it is easier to get the day from the date than the date from the day!)
(Personal preference – not always viable) Use ‘scientific’ notation for dates (YYYYMMDD) to avoid ambiguity between ‘US’ and ‘UK’ systems – and the difficulties in switching between them.
Use the 24-hour clock (saves the space for AM and PM, reduces ambiguity and generally is easier to process).
Not so important nowadays but should consider codes (with a lookup table if desired) such as YL for Yellow rather than indeterminate length strings – saves on data storage so less cost, more speed win/win.

What is the best database schema for an availability calendar that allows scheduling appointments(reoccurring and single))

In my application I have a provider that has a schedule and clients that book appointment from the schedule. I need the following features.
Provider:
- Be able to specify reoccurring availability. For example Mon 9-5, Tues 10-3, etc.
- Be able to black out datas. For example - not available on this Mon.
- Be able to add single, not reoccurring dates/times slots. For example - This Sat 9-5.
Customer:
- Be able to book single appointments.
- Be able to book reoccurring appointments. (Every Mon 9-4).
So far I came up with 3 options:
Divide the schedule into 30 min intervals and create a database entry for each interval/provider pair. Each interval can be either free or booked. When a customer books an appointment we mark the intervals as booked. The problem with this approach is that it wastes a lot of space, and I am not sure how good the search performance would be for a reoccurring booking.
Save each availability period as an "event". If it is reoccurring, duplicate the event. When searching for free slots search the booking table to make sure that there is no over lapping booking. In this case, searching for reoccurring slots seems a bit awkward. To find all the providers that are available on Mon 9-5 for the next year we will have to search for all the matching 'events' and find all the providers that have 52 matched events.
Save each availability period as an "event". Add a flag if it is reoccurring.When searching for free slots search the booking table to make sure that there is no over lapping booking. It makes it easier to search for reoccurring appointments. To "black out" slot that are suppose to be reoccurring we can just insert a fake booking.
1.Create a event table:
a) With the basic columns eventdate, starttime, endtime, with other details for the event - these are the busy times so are what you block out on the calendar
b) Recurring Events - add columns:
- isrecurring - defaults to 0
- recurrencetype (daily, weekly, monthly)
- recurevery (a count of when the recurrence will occur)
- mon, tue, wed, thur, fri, sat, sun - days of the week for weekly recurrence
- month and dayofmonth - for monthly recurrence
2.The challenge comes when creating the recurring events on the calendar:
- if you create all of them at once (say for the next 6 months), whenever you edit one the others have to be updated
- If you only create an event when the previous one has passed then you need complex logic to display the calendars for future dates
3.You also need rules to take care of whether events are allowed to overlap each other, what resources are to be used, how far ahead the events can be scheduled

Cache a complex calculation in Rails 3 model

I'm new to Ruby/Rails, so this is possibly (hopefully) a simple question that I just dont know the answer to.
I am implementing an accounting/billing system in Rails, and I'm trying to keep track of the running balance after each transaction in order to display it in the view as below:
Date Description Charges($) Credits($) Balance($)
Mar 2 Activity C $4.00 -$7.50
Feb 25 Payment for Jan $8.00 -$3.50
Feb 23 Activity B $1.50 -$11.50
Feb 20 Activity A $2.00 -$10.00
Each transaction (also known as line item) is stored in the database, with all the values above (Date, Description, Amount) except for the Balance. I can't store the balance for each transaction in the database because it may change if something happens to an earlier transaction (a payment that was posted subsequently failed later for example). So I need to calculate it on the fly for each line item, and the value for the Balance for a line item depends on the value for the line item before it (Balance = Balance of Prev Line Item + Amount for this Line Item, i.e.)
So here's my question. My current (inept) way of doing it is that in my LineItem model, I have a balance method which looks like :
def balance
prev_balance = 0
#get previous line items balance if it exists.
last_line_item = Billing::LineItem.get_last_line_item_for_a_ledger(self.issue_date,self.ledger_item_id)
if last_line_item
prev_balance = last_line_item.balance
.. some other stuff...
end
prev_balance + (-1*net_amount) # net_amount is the amount for the current line item
end
This is super costly and my view takes forever to load since I'm calculating the prev line item's balance again and again and again. Whats a better way to do this?
You're basically paying a price for not wanting to store the balance in each transaction. You could optimize your database with indices and use caches etc; but fundamentally you'll run into the problem that calculating a balance will take a long time, if you have lots of transactions.
Keep in mind that you'll continue to get new transactions, and your problem will thus get worse over time.
You could consider several design alternatives. First, like Douglas Lise mentioned, you could store the balance in each transaction. If an earlier dated transaction comes in, it means you may have to do an update of several transaction since that date. However, this has an upper-bound (depending on how "old" transactions you want to allow), so it has a reasonable worst-case behavior.
Alternatively, you can do a reconciliation step. Every month you "close the books" on transactions older than X weeks. After reconciliation you store the Balance you calculated. In def balance you now use your existing logic, but also refer to "balance as of the previous reconciliation". This again, provides a reasonable and predictable worst-case scenario.

Resources