I've seen an example of an aggregation between A and B, where B is the whole class, with multiplicities 0..* on the B end and 5 on A. Can it accurately be represented with relational tables? There should be a m:n AB table, but each value of B should appear exactly 5 times in it. Is it simply represented as an m:n table and when let's say selecting Bs, those that don't appear 5 times in the AB table are filtered out to get only valid data? (Valid from the user's point of view, not the DBMS's). Still doesn't seem right. Are there other workarounds?
And what if the multiplicity on the B end is changed to 1..*, so each A must appear at least once in the AB table? How could the data be accurately represented in a tabular format?
Related
Let's assume i have some 2 simple tables:
IMPORTANT: This is about relational algebra, not SQL.
Band table:
band_name founded
Gambo 1975
John. 1342
Album table:
album_name band_name
Celsius. Gambo
Trambo Gambo
Now, since the Band and the Album table share the same column name "band_name", would it be necessary to rename it when i would join them?
As far as i know, the join eliminates the duplicate entry that is shared amongst the join. This example, where i simply pick all Bands that are existing in the Album table (obviously just 'Gambo' in this giving example)
Πfounded, band_name(Band ⋈ Album)
should therefore work fine, right? Can somebody confirm?
(Have to enter a caveat that there are many variants of Relational Algebra; that they differ in semantics; and they differ in syntax. Assuming you intend a variant similar to that in wikipedia ...)
Yes that expression should work fine. The natural join operator ⋈ matches same-named attributes between its two operands. So the subexpression Band ⋈ Album produces a result with attributes {band_name, founded, album_name}. Your expression projects two of those.
Note the attributes for a relation value are a set not a sequence; therefore any operation over relation operands with same-named attributes must match attributes.
In contrast, Cartesian Product × requires its operands to have disjoint attribute names. Then Band × Album is ill-formed and would be rejected. (So you'd need to Rename band_name in one of them, to get relations that could be operands.)
I'm not all that happy with your way of putting it "the join eliminates the duplicate entry that is shared amongst the join." Because only in SQL do you get a duplicate (from SELECT * FROM Band, Album ... -- which results in a table with four columns, of which two are named band_name). SQL FROM list of tables is a botch-up: neither join nor Cartesian Product, but something trying to be both, and succeeding only in being neither. RA's ⋈ never produces a "duplicate" so never does it "eliminate" anything.
Particularly if there's Keys declared and a Foreign Key constraint (from Album's band_name to Band's) I see those as identifying the same band, then the natural operation is to bring together that which has been taken apart, so the name 'Natural Join'.
I want to convert given ER diagram with (min, max) notation to tables and im unsure of what the primary key of the "trainieren"-relation is.
If the Relation R is between A and B and:
one to one -> the primary key is the primary key of either A or B
one to many -> the primary key is the primary key of the entity that takes part multiple times in the relation
many to many -> the primary key is the primary key of A and B
I'd interpret (0,1) and (1,1) as one and (1,3) and (1,*) as many, therefore
my solution would be (primary keys in strong text) trainieren: {[Trainer.AkkrNr, Teams.Land]}
Generally, we try to use only one or many cardinality indicators, since those map easily to table structure. Most data modelers would do the same as you did in discarding the upper limit to simplify the model.
If you want to enforce that limit, there are a few ways to implement it:
Use your structure and a trigger on insert/update to count how many trainers the given team has, and throw an error if it exceeds 3.
You could add a position column to the primary key of trainieren and a constraint to limit it to values 1, 2 and 3. However, that imposes an ordering that wasn't part of the conceptual model.
Change trainieren to (Teams.Land PK, Trainer1.AkkrNr, Trainer2.AkkrNr, Trainer3.AkkrNr). Trainer2 and Trainer3 would need to be nullable, and this design loses the constraint that each trainer belongs to only one team. You could fix that with a trigger. Yuck.
Since there's no ideal way to implement an upper bound on the relationship cardinality, most data modelers would follow the same approach as you did, and leave it to the database client (usually the application logic) to enforce that limit.
I have my Fact table with Policy data in it & I want to add Policy Products details to the warehouse.
One policy gets different types of products and the values also are dynamic.
Eg: Policy01 may have two products Building & Contents where sum insured values are 1000 & 500 respectively. And Policy02 get Building only of 750.
There are like 30 products available and I need to store sum insured value, gross & net premiums of each product per policy.
So if I add separate column for each product type into fact table it'll add live 120 more columns (currently there are 23 columns). Also max 5 products per policy so only 20 columns will contain values & others remain empty.
Is it ok to have 100+ columns for fact table? Is it ok to keep this many empty values in a row?
Or is there any other approach I can solve this?
I'm a novice at DWH and hope someone can shed me some light how to add these to my fact table.
One approach is to add a product dimension:
You can then return totals by policy:
SELECT
PolicyKey
SUM(PolicyProductValue) AS PolicyValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey
;
Or product:
SELECT
ProductKey,
SUM(PolicyProductValue) AS ProductValue
FROM
Fact.PolicyProductValue
GROUP BY
ProductKey
;
Or both:
SELECT
PolicyKey,
ProductKey,
SUM(PolicyProductValue) AS PolicyProductValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey,
ProductKey
;
This approach moves the products from the columns to the rows.
This technique offers several benefits:
It is easier to add new rows than columns.
You can add common filters to Dim.Product.
Dim.Product provides a location to create product hierarchies. Example:
| Product Key | Product Name | Product Group |
| ----------- | ------------ | --------------------|
| 0 | Building | Building & Contents |
| 1 | Contents | Building & Contents |
It's not ok to have 100+ columns in a fact table; it's a symptom of an incorrect data model (the same is true for missing values - a well designed fact table shouldn't have any).
The logic of the fact table design is the following:
First, deside on the table "granularity" - the most atomic level of data it will contain. In your case, data granularity is defined by Policy number + Product. Together they uniquely identify the most detailed information available to you.
Then, identify your "facts". Typically, facts are pieces of data that you can aggregate (sum, count, average, etc). In your case, they are Insured_Value, Gross_Premium, Net_Premium.
Finally, define business context for these facts (dimensions). In your case, they are Policy and Product (most likely, you will also have some kind of Date).
Your resulting fact table should look something like this:
Policy_Date
Policy_Number
Product_ID
Insured_Value
Gross_Premium
Net_Premium
Policy_Date will provide connection to "Calendar" dimension, Product_ID will connect to "Product" dimension (table that contains your 30 products and their descriptions).
Policy_Number is what's called a "Degenerate Dimension" - it's an ID that is usually not connected to any dimensions (but could if you need to). It's stored in a fact table just as a reference. Some people add "Policy" dimension to the model, but usually it's a design mistake - such dimensions are too "tall", comparable in size to the fact table, which can dramatically slow down your model performance. It's usually better to split policy attributes into multiple small dimesions and leave the policy number as a degenerate dimension.
So, your typical policy with 5 products will be represented as 5 records in the fact table, rather than one record with 5 fields. This is the critical difference - never, ever store information (products in your case) in the name of the fact table fields.
If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.
I have a nooby question regarding 1NF.
As I read from different sources a table is in the 1NF if it contains no repeating groups.
I understand this with the examples given online (usually with customers and contact names etc) but when it comes to my specific data I face difficulties.
I have the following fields:
ID TOW RECEIVER Phi01_L1 Phi01_L2 Phi01_L3
1 4353 gpo1 0.007 0.006 0.4
2 4353 gpo1 0.9 0.34 0.3
So, this table here is not in 1NF? How should it be in order to become?
What is Fist normal form (1NF)?
1NF- Disallows:
composite attributes
multivalued attributes
and nested relations; attributes whose values for an individual tuple are non-atomic
How to convert a relation into 1NF?
Two ways to convert into 1 NF:
Expand relation:
Increase number of colons in relation (as you did)
Increase rows and change Primary key value. (PK will include non-atomic attribute)
Hence your relation looks in 1-NF in present relation-state. and solution you made is expansion.
Break Relation:
Break relation into two relations -e.g. remove non-atomic col from base relation and create a new relation and add to new with PK.
Normal forms are best explain in Elmasri/Navath book