Dimension Model for New customers in current month - data-warehouse

I have started to work on a dimension model to know the count of new customers who visited a shop. I'm little bit confused in identifying the facts and dimensions for this purpose. Can someone help me in this regard?
As per my understanding, I have identified Customer, Product, Invoice, Time, Payment as dimensions (as per level 0). But, I'm not sure of how to identify fact in this regard. I know that facts are those data which are measurable. The measure or the result that I want to achieve post data model is how to get count of new customers that visited in current month.

Does each visit result in an invoice? How is "new customer" defined (i.e., their first invoice, some time period after their first invoice?)
If so, one option would be to create a "factless fact table" to capture each invoice event and create a dimension to indicate that the invoice is the first invoice for that customer (i.e., New Customer). You could then use a distinct count of patients where the New Customer dimension indicates that they are a new customer.
FactVisit(TimeKey,CustomerKey,InvoiceKey,ProductKey,PaymentKey,NewCustomerIndicatorKey)
DimNewCustomerIndicatorKey(NewCustomerIndicatorKey, ...) {"Y", "N")
Another possibility would be a separate fact table that captures a row the first time a customer is seen.
One additional option would be to include an attribute in the Customer dimension that holds the date that the customer was first seen.
A lot depends on what/if any OLAP front-end you're using and what tool you're using to report the results.

Related

Unit Price and Discounts - Fact or Dimension Table

I'm working on a datamart for our sales and marketing departments, and I've come across a modeling challenge. Our ERP stores pricing data in a few different ways:
List pricing for each item
A discount percentage from list pricing for a product line, either for groups of customers or for a specific account
A custom price for an item, either for groups of customers or for a specific account
The Pricing department primarily uses this data operationally, not analytically. For example, they generate reports for customers ("What special pricing / discount %s do I have?") and identify which items / item groups need to be changed when they engage in a new pricing strategy.
Pricing changes happen somewhat regularly on a small scale, usually on a customer-by-customer or item-by-item basis. Infrequently, there are large-scale adjustments to list pricing and group pricing (discounts and individual items) in addition to the customer-level discounts.
My head has been in creating one or more fact tables to represent this process. Unfortunately, there's no pre-existing business key for pricing. There's also no specific "transaction date," since the ERP doesn't (accurately) maintain records of when pricing is changed. Essentially, a "pricing event" is going to be a combination of:
Effective date
End date
Item OR product line
(Not required for list price) customer or customer group
A price amount OR discount percentage
A single fact table seems problematic in that I'm going to have to deal with a lot of invalid combinations of dimensions and facts. First, a record will never have both a non-NULL price amount and a non-NULL discount percentage; pricing events are either-or. Second, only certain combinations of dimensions are valid for each fact. For example, a discount percentage will only ever have a product line, not an individual item.
Does it make sense to model pricing as a fact table in the first place? If so, how many tables should I be considering? My intuition is to use at least two, one for the percentages and one for the price amounts, but this still leaves a problem where each record will either have a valid customer group OR a valid customer (or neither, for list prices), since we need to maintain customer-specific pricing separate from any group pricing that customer might have.
You may need to keep them both as attributes and as facts.
The price a certain item was sold for is a fact. When you multiply it by the quantity sold it's actually an additive measure. So, keep it in the fact table. Total discount applied is also additive, I'd keep it. You can later query "how much was discounted in 2019 per customer", which would be much harder to achieve without those facts.
But if you also need to query things like "what's the discount customer X is on", then you should also keep that as an attribute of the customer dimension, and treat it as a type II dimension, so as to keep discount history. If you know when a certain discount was applied, great, if not take the 1st sale as the start date and you won't be too far off.
Maybe the list price can also be kept as an attribute of product or product line in a dimension, but only if they don't change too often; but if most customers get discounts anyway that would be of limited use.

HR Data Mart Design Advice

I am working on a design for an HR data mart using the Kimball approach outlined in 'The Data Warehouse Toolkit'.
As per the Kimball design, I was planning to have a time-stamped, slowly-changing dimension to track employee profile changes (to support point-in-time analysis of employee state) and a head-count periodic snapshot fact table to support measures of new hires, leavers, leave taken, salary paid etc.
The problem I've encountered is that, in some cases, our employees can be assigned to multiple roles/jobs and each one needs to be tracked separately (i.e. the grain of my facts has to be at job-level, not employee level).
How might the Kimball design be adapted to fit a scenario where employee and role/job form a hierarchy like this? Ideally, I want to avoid duplicating employee profile data (address, demographics etc) for each role/job an employee is assigned to, but does this mean I need to snow-flake the dimension?
Options I've been considering include the below - I'd be interested in any thoughts or suggestions the community has on this so all input is welcome!
1) (see attached, design 1) A snowflake-style approach with an employee table which has a 1-to-Many link role table, which, in turn, has a 1-to-many link with the fact table. The advantage here is a clean employee dimension but I don't want to introduce unnecessary complexity. Is there any reason why I shouldn't link both dimensions directly to the fact table? The snowflake designs I've seen don't seem to do this.
2) (see attached, design 2) A combined Employee/Role dimension where each employee has a record for each assigned role but only one on them is flagged as 'Primary Role'. Point-in-time queries on the dimension can be performed by constraining on the 'Primary Role' flag.
Anything that occurred is an event and can be a fact. When you look at relationships between data, you need to also ask if the data value describes the entity (dim) or something that happened to/with the entity(fact). Everything can be a dim or a fact.(sometimes both)
A job describes an event that happened to the employee. You should have a fact employeejob that relates to the Dim employee and Dim job (as well as your date dimensions). This will then allow you to break down absences by employee and job. Your dim job would really just be job title, pay grades, etc. The fact would contain effective dates. Research factless fact tables.
Note that your vacancy reference would be part of a separate fact (when/where did you post it, how many applicants are all measurable facts about the vacancy). This may also be an example of a degenerate dimension.
I'm not fond of your monthly fact. I think that should just be some calculated measures built on fact absence and fact employeejob. When those events are put up against your dimensions, you can break them down by date, job type, manager, etc.

FactLoanVolume - One or Many Fact Tables

I am designing a Fact table to report on loan volume. The grain is one row per loan transaction. A loan has a few major milestones that we report on: In order of sequence, these are Lock Volume, Loan Funding Volume and Loan Sales Volume.
I have Lock Date, Loan Funding Date and Loan Sale Date as FK (there are other dimensions in addition to these) in the Fact table to role playing dimensions off my DimDate table.
My question is, should I create separate Fact Tables to report volume for each major milestone or should I keep all of this in one Fact Table and use a "far in the future" date (e.g., 12/31/2099) for a milestone on a loan that has not been met?
I have read the Kimball books but I didn't find a definitive answer(if one even exists).
Thanks
You may profit from immutable design, by setting the granularity more fine to the milestone level.
This gives you columns
transaction_id
milestone_type
milestone_date
in you fact table. The actual milestone of a transaction is the milestone from the last (most recent) record.
The one adavatage is that you may add new milestone types in the future, but the main gain is, that you never update your fact table - you use inserts only.
You may safe rollback a wrong ETL load, simple by deleting the records; which is while using updates much complicated.
You may also implement more complicated state diagrams, e.g. in case when some milestone is revoked and the transaction falls back in the previous state.
The question if you use one fact table or more depends on the fact if your milestones are homogenous or not. If the milestones have distinct attributes, you may get a more clean desing using dedicated fact tables, but the queries get complicated.
You would rather have only one Fact Table.
That following question and its conversation answer pretty well to the general question of " One or multiple fact tables? ", but maybe not to how to deal with your specific problem of dates.

Data warehouse multivalued attributes

Disclaimer: I have never created a data warehouse before. I have read several chapters of Kimball's Data Warehouse Toolkit.
Background: Plant (factory) management team needs to be able to slice and dice production information in various ways, and we want a consistent reporting format across manufacturing plants in our division. Through business analysis, we have concluded that the fact grain is 1 row per process completed. A completed process can either mean "machine" or "assemble." I am calling this the "Production fact".
The questions that the business needs to answer are the following:
Who was working when the process completed?
What was the cycle time of the process?
What is the serial number of the part was being produced by the process?
My schema includes the following first-level dimensions. I do not have any dimensions beyond the first level, but there are some cross relations between the plant dimension and the part type, shift, and process dimensions.
Part Type (Attributes: Surrogate Key, Part Number, Model, Variant, Part Name)
Plant (Attributes: Surrogate Key, Plant Name, Plant Acronym)
Shift (Attributes: Surrogate Key, Plant Key, Start Hour24, Start Minute, End Hour24, End Minute)
Process (Attributes: Surrogate Key, Plant Key, Production line, Process Group, Process
Name, Machine Type)
Date (typical date dimension attributes)
Time of Day (typical time of day dimension attributes)
The non dimensional facts are:
Part serial Number (instances of a part type)
Cycle time
Employee ID(s) *MULTI-VALUED*
Problem
My problem is that more than one employee may have been working the process at the time. So, I am wondering if I need to change my model and how to best represent the employee in the model. We are not trying to house employee information, just their company employee ID. I've considered the following options:
Allow for multiple employee IDs in the employee column of the fact table (e.g. comma separated). Disadvantage: the number of employees working on the process is a variable number. Would I need to create the field big enough to accommodate up to X number of employees? What should X be?
Create a record for each production fact per employee. This would be mean more than one record for the same fact; that would be bad. :)
Create an employee dimension and an "Process Employees" bridge table between the employee dimension table and the fact table. Problem: the employees working on the process at the time are not represented in the fact table.
Create an Employee dimension, a Process Employees Group table, and a bridge table between Process Employees Group table and the Employee dimension table. The employee group and bridge tables would need to be a) pre-populated with all possible employee combinations--this is not practical on any level since we have thousands of employees-- or b) populated on the fly during ETL. 4b would require a check to see if a given group employees already existed for each process; this might be taxing on the DBMS/ETL system if the source records are batched more frequently than a few times per day (e.g. 10 X's per hour for near real-time reporting).
My Question(s)
I'm thinking that option 3 is the most viable option, but I have some reservations. Are there potential watch-outs? Are there other alternatives that I should consider? Is it okay to take the employees who worked on the process out of the fact table?
Thank you for any advice.
There is a concept called slowly changing dimensions.
These are considered dimensions; basically over here the table which I will call PartEmployee;
The structure of this table will be
PartId - PK
EmployeeId - PK
EmployeeStartDate - PK
EmployeeEndDate
The End Date will be null if the employee is still working on the part. When a new employee starts working on the part, the previous employee record for the part will be closed and a new record created for the part with the new employee.
Add an employee on the PartFact table;
EmployeeId
This column will hold the current employee; This fact record will be updated everytime a new employee starts working on the part...
This will give you the historical perspective of which employees worked on the part and also the information of the employee who worked on the part last.
Hope this helps...
I've had time to think about my options, and none of the 4 options listed in my original post are correct. The problem discussed seems to be a classic "coverage" problem; the business needs to know which employees were working which processes at a given time. If we have that information, we will know who worked who was working on a particular part when a given process completed. This would best be represented as a fact-less fact table between an employee dimension and the production process dimension.
This approach helps also helps me to save space and improve querying power because a single employee "coverage" fact will span multiple process production facts.

Entity relationship diagram advise needed

I am creating a database based on a ERD i have designed according to some business rules where I am allowed to make assumptions and implement them for the future.
Business rule:
Entity relationship diagram
Based on the business rules the customer is invoiced for the holiday, hence the relationship would be 1..1, however I have been left to assume that the customer may receive one or more invoices for the same reservation, that's if the customer makes changes to the reservation or a reminder invoice is raised.
IF i leave the relationship 1..1 then i might a swell get rid of the invoice table and use the reservation as the invoice since they use the same attributes and link it to the payment_method.
I don't know which way is best, first time doing databases...
Please advise
It almost sounds to me like you should make it a 1 to many relationship between the invoice and the reservation. You say that a customer may receive multiple invoices for a single reservation, such as if the reservation changes. That makes me think that it should be a one reservation to one or more invoices.
What I might include on the invoice table would be a field telling if it is the latest invoice, or a nullable field pointing to the next invoice. If an invoice becomes invalid/outdated/superseded, then a new invoice is created and all previous invoices then have their superseded field filled in to point to the most current invoice. That way you can still keep a trail of previous invoices as well as the current one.

Resources