Simple scenario: I'd like to create data warehouse which information about "issues" (cost, wroking time etc.). issue also has status which might change over time. So then i'm creating fact table called issueRealization decribing each issue.
My question is: should i create "issue" dimension which will give me one to one relationship beetwen dimension and fact table? Or i should divide Issue dimension to smallest dimension like status etc?
Issue status tracking is a good case to use an Accumulating Snapshot fact table, to track the changes in the status of an issue over time.
As an example, let's say this is an IT issue/bug/enhancement management system, with issues that only have 3 statuses, 'Created' and 'In Progress' and 'Resolved'.
The issue fact table would look like such:
ID Number (Degenerate Dimension)
Issue description (Degenerate dimension. You can also create a 1-1 table for these if it's not often used in reporting)
Type ID (bug/enhancement/etc, this is a dimension key)
Assigned Developer ID (Dimension key)
Current Status ID (Status dimension key)
Date Created (DATE dimension)
Created Flag (1 = created, 0 = otherwise)
Date In Progress (DATE dimension)
In Progress Flag (1 = created, 0 = otherwise)
Date Resolved (DATE dimension)
Resolved Flag (1 = created, 0 = otherwise)
Created Datetime (measure)
InProgress Datetime (measure)
Resolved Datetime (measure)
Worktime Interval (measure)
Cost (measure)
The grain of this table is 1 row per issue ID number.
With this type of fact table, you update the same row each time the source system modifies an issue. Note how we create a field for each status type, as well as a datetime record to allow us to compute metrics such as "time between created and resolved status". In addition, I added an interval field to allow you to store "actual" work time, such as "hours" the developer put towards the fix. This could easily be an integer.
This table would then be able to answer any questions about an issue, and provide rollups to show "how many issues took longer than 1 week to resolve", etc.
Related
I have two input dimensions i.e. Day and Product_sold and I want to create a calculated field "Flag" in Tableau. Basically Flag will show "Yes" if the product was sold on all days, else No (see example attached), can you please help? I have tried multiple things but no use
You can create a fixed LoD calculation to count distinct number of days in data. Then used another fixed LoD (or possibly a table calc) to count distinct days for each product. If the product COUNTD = dataset COUNTD than it sold on every day.
What have you tried so far? This looks like a simple attribution calc that could be put together as boolean eg:
Product_sold = 'computer'
Steve
I have an aggregation that looks at a sliding 30-day window (1 day period) of customer purchases, keyed by customer id, with the value being the purchase amount. I sum up the values by key, thus getting the aggregate purchase amount for each customer during the last 30 days. I store this number in a customer record in an external database.
My question is this: if a customer hasn't purchased anything in the last 30 days, how do I automatically reset the customer record to a default value, in this case zero? I'd prefer to keep all my logic in Dataflow and avoid doing too much work, since this will need to scale quite a bit. I'm basically looking for a way to automatically get a key-value for each key that was not in the current window but was in the last, and the value being a potentially configurable default.
Trying to answer my own question, but hoping for feedback as to whether this solution would scale:
I've thought about having a step after the initial window-and-sum. This transform would receive (customerId, purchaseSum) elements once a day, as the result of the 30-day window sum is made available. Since these elements are timestamped (with the timestamp of the most recent input element, I believe) I can re-window them. If I create a two-day window with a one-day period, I would then be able to group by key and process (customerId, [purchaseSumA, purchaseSumB]) for customers that had a purchase both in the last 30 days and in the last 31 days. In this case, I emit purchaseSumB. However, if there's only in element in the list, and the timestamp indicates that the purchase was made 31 days ago, I can assume that there were no purchases from the customer since, and I need to emit (customerId, 0). Does that make sense?
Is it an option to slightly amend the database schema? I suppose now you have something like
(customer_id int, purchases_last_month int)`
Instead how about
`(customer_id int, last_purchase datetime, purchases_last_month int)`
where this time last_purchase is the time of the last purchase made by this customer, and purchases_last_month refers to purchases made in the month before the last one? Then in your DoFn that writes to the database, you'd be making a conditional update (merge/upsert) that updates both last_purchase and purchases_last_month with the values from the current window, but only if last_purchase is increasing. This way you can deal with windows being processed out-of-order or in parallel, at the cost of slight increase in complexity in client queries (which you can address by adding a view on top of the table).
I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?
You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).
I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.
First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.
This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited
Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).
I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!
The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.
Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.