Surrogate keys in star schema hierarchy dimension - data-warehouse

Is it necessary to have surrogate keys for each hierarchy level above the lowest level in a dimension table?
Row City_Key City_Name State
1 1234 Chicago Illinois
2 3245 Dallas Texas
3 4563 Huston Texas
4 3457 Seattle Washington
vs
Row City_Key City_Name State_Key State
1 1234 Chicago 535 Illinois
2 3245 Dallas 659 Texas
3 4563 Huston 659 Texas
4 3457 Seattle 912 Washington
If so, how would I go about generating surrogate keys for levels in the hierarchy with SQL if it would not suffice to have an auto-incrementing key which would change per row like the lowest level key?
Would it be better to use a snowflake schema with normalized hierarchy dimensions or perhaps create/manage a denormalized hierarchy dimension table through joining a normalized hierarchy?

Is it necessary to have surrogate keys for each hierarchy level above the lowest level in a dimension table?
No. In a star schema there is no need, as the attribute hierarchies are modeled as non-key columns of a single dimension table.
In a snowflake design, where each level of the hierarchy is modeled as a separate table, such keys would of course be required.

Related

Can observations in a fact table also be dimensions?

My understanding is that a fact table uses keys, which are dimensions that ought to have their own dimension table, to identify observations and assign them values. Can these values themselves be dimensions? Or does that violate some principle of a star schema?
For example, is this a valid fact table design?
Start Time
Stop Time
Employee ID
Performance
01
60
0100
Grade 3
01
20
0200
Grade 2
20
60
0200
Grade 3
My dimensions that I use to identify facts are the first three columns, with the final column being an observation. However, if I have more information about what each Performance means, does that mean that there needs to be a Performance dimension table? Or, because Performance is an observation rather than a dimension, does this data need to be in the fact table itself?
In a fact table there are normally 3 types of column:
measures: anything that can be aggregated
dimension keys: key to a record in a dimension table
degenerate dimensions: attributes that do not naturally sit in a dimension (often because they would be the only attribute in the dimension)
It is also possible for an attribute to be both a measure in a fact table and an attribute in a dimension. For example, the price of a product could be both a measure in a fact table and an attribute in the product dimension
Does this help?
Update
Say you wanted to know the average price of the products you have sold: in this case product price is a measure and lives in a sales fact table; that fact table would almost certainly have an FK to your Product Dimension - so you could filter on product attributes e.g. average price of product for products whose category = "Food".
You might also want to filter a query based on product price: in this case product price is an attribute in your product dimension (which would probably be an SCD2 dimension to cater for price changes). For example, you might want to query your stock-level fact table (which doesn't hold product prices as measures but does have an FK to the product dimension) for all products whose price is between £10 and £20

Relations between slowly changing dimensions in a data warehouse

I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:
UserKey
UserID
Language
EffectiveDate
ExpiryDate
IsCurrent
1
1001
EN
2021-01-01
9999-12-3
Y
2
1002
EN
2021-07-31
2022-01-06
N
3
1002
FR
2022-01-06
9999-12-31
Y
And a Login fact table like:
LoginKey
UserKey
LoginTime
12345
2
2021-12-25 15:00
12399
3
2022-01-31 18:00
Thereby allowing us to report on logins by date by user language setting at the time, etc.
Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:
SubsKey
SubsID
SubsType
UserKey
StartDate
EndDate
55501
SBP501
Premium
2
2021-08-01
2022-08-01
55502
SBB123
Bonus
3
2022-01-31
2023-01-31
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.
Edit
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?
A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name.
You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.
Relating dimensions directly to each other is called snowflaking and is, generally a bad design.
BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.
Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.
Indeed, I agree that subscriptions might be a Fact table and have a Dimension table.
Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.

integrating my oracle database into my owl with ontop

I struggle integrating my oracle database into my owl in terms of keys.
So If I have 2 tables
Customer
Name
Adress
IDNumber
First
row
1234
Second
row
5678
Billing
BillingNo
Date
IDNumber
987
row
1234
986
row
1234
654
row
5678
and I want to map the tables with my ontology, do I have to set up my IDNumbers as the same Data property with two Domains (Customer, Billing)? Or would it be a better solution to set up two IDNumbers with a different prefix (cust_IDNumber, bill_IDNumber) and combine them with "equivalent to"?
Is there a difference between both methods (performance, validity etc.)?
Could anyone please suggest a good book or a good tutorial?

How to model a dimension table that link to several facts with different level of grain?

I have a fact that store client's address. Problem is, the client can choose to insert information at state level, or county level, or street level. In the operation database, there is 1 table for streets, link to another table for counties, link to another table for states. The client table has 1 column for state, 1 column for county, 1 column for street that contain ID (so can link to higher object in the hierarchy)
How can I model the relationship between the fact and the dimension in a star-schema?
So I created one Location dimension with all states, all counties, all streets. The table look like this:
DIM_ID | Level | Street columns | County columns | State columns
1 | Street | Bolsa | Westminton | California
2 | County | Westminton [county] | Westminton | California
3 | State | [State of] California | [State of] California | California
If client disclose street then fact record link to row 1, client disclose county level then fact record link to row 2, client disclose only state then fact record link to row 3.
What do you think of that approach?
I would probably model these levels separately, as they are being treated as separate, i.e.:
dim.Street
dim.County
dim.State
As for relating these to clients, I'd go for bridge tables, e.g.:
CREATE TABLE ClientStreet
(
ClientID
, StreetID
)
Etc.
And if you cannot provide a Street without providing a County and State, or provide a County without providing a State, I would have within dim.Street, CountyID, and within dim.County, StateID, i.e. a hierarchical structure.
EDIT
With regards to your client dimensions with 3 IDs, this could also be a good model.
With regards to my hierarchical structure, and data modelling in general, I feel the way you model it really needs to:
Reflect reality (e.g., recording a client's Street as "[State of] California" seems inaccurate to me).
Reflect the reality of what is possible in terms of incoming data, i.e. do your clients input their address once, or can they change it (do you want to track changes?)?
One thing I'm wondering is if whether your clients have to pick one and only one of these levels to record their address at. In this case, I'd have either the model I'd suggested above, or I'd have client dimension with 3 IDs, and a CHECK CONSTRAINT to ensure that only one of these was ever populated. This would then be supported be the fact that a Street would have a CountyID, etc. So you would determine this kind of "channel" by which ID is populated in the client dimension.

Aggregation - relational db mapping

I've seen an example of an aggregation between A and B, where B is the whole class, with multiplicities 0..* on the B end and 5 on A. Can it accurately be represented with relational tables? There should be a m:n AB table, but each value of B should appear exactly 5 times in it. Is it simply represented as an m:n table and when let's say selecting Bs, those that don't appear 5 times in the AB table are filtered out to get only valid data? (Valid from the user's point of view, not the DBMS's). Still doesn't seem right. Are there other workarounds?
And what if the multiplicity on the B end is changed to 1..*, so each A must appear at least once in the AB table? How could the data be accurately represented in a tabular format?

Resources