integrating my oracle database into my owl with ontop - mapping

I struggle integrating my oracle database into my owl in terms of keys.
So If I have 2 tables
Customer
Name
Adress
IDNumber
First
row
1234
Second
row
5678
Billing
BillingNo
Date
IDNumber
987
row
1234
986
row
1234
654
row
5678
and I want to map the tables with my ontology, do I have to set up my IDNumbers as the same Data property with two Domains (Customer, Billing)? Or would it be a better solution to set up two IDNumbers with a different prefix (cust_IDNumber, bill_IDNumber) and combine them with "equivalent to"?
Is there a difference between both methods (performance, validity etc.)?
Could anyone please suggest a good book or a good tutorial?

Related

Relations between slowly changing dimensions in a data warehouse

I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:
UserKey
UserID
Language
EffectiveDate
ExpiryDate
IsCurrent
1
1001
EN
2021-01-01
9999-12-3
Y
2
1002
EN
2021-07-31
2022-01-06
N
3
1002
FR
2022-01-06
9999-12-31
Y
And a Login fact table like:
LoginKey
UserKey
LoginTime
12345
2
2021-12-25 15:00
12399
3
2022-01-31 18:00
Thereby allowing us to report on logins by date by user language setting at the time, etc.
Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:
SubsKey
SubsID
SubsType
UserKey
StartDate
EndDate
55501
SBP501
Premium
2
2021-08-01
2022-08-01
55502
SBB123
Bonus
3
2022-01-31
2023-01-31
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.
Edit
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?
A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name.
You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.
Relating dimensions directly to each other is called snowflaking and is, generally a bad design.
BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.
Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.
Indeed, I agree that subscriptions might be a Fact table and have a Dimension table.
Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.

Preferred no of columns for a Fact table?

I have my Fact table with Policy data in it & I want to add Policy Products details to the warehouse.
One policy gets different types of products and the values also are dynamic.
Eg: Policy01 may have two products Building & Contents where sum insured values are 1000 & 500 respectively. And Policy02 get Building only of 750.
There are like 30 products available and I need to store sum insured value, gross & net premiums of each product per policy.
So if I add separate column for each product type into fact table it'll add live 120 more columns (currently there are 23 columns). Also max 5 products per policy so only 20 columns will contain values & others remain empty.
Is it ok to have 100+ columns for fact table? Is it ok to keep this many empty values in a row?
Or is there any other approach I can solve this?
I'm a novice at DWH and hope someone can shed me some light how to add these to my fact table.
One approach is to add a product dimension:
You can then return totals by policy:
SELECT
PolicyKey
SUM(PolicyProductValue) AS PolicyValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey
;
Or product:
SELECT
ProductKey,
SUM(PolicyProductValue) AS ProductValue
FROM
Fact.PolicyProductValue
GROUP BY
ProductKey
;
Or both:
SELECT
PolicyKey,
ProductKey,
SUM(PolicyProductValue) AS PolicyProductValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey,
ProductKey
;
This approach moves the products from the columns to the rows.
This technique offers several benefits:
It is easier to add new rows than columns.
You can add common filters to Dim.Product.
Dim.Product provides a location to create product hierarchies. Example:
| Product Key | Product Name | Product Group |
| ----------- | ------------ | --------------------|
| 0 | Building | Building & Contents |
| 1 | Contents | Building & Contents |
It's not ok to have 100+ columns in a fact table; it's a symptom of an incorrect data model (the same is true for missing values - a well designed fact table shouldn't have any).
The logic of the fact table design is the following:
First, deside on the table "granularity" - the most atomic level of data it will contain. In your case, data granularity is defined by Policy number + Product. Together they uniquely identify the most detailed information available to you.
Then, identify your "facts". Typically, facts are pieces of data that you can aggregate (sum, count, average, etc). In your case, they are Insured_Value, Gross_Premium, Net_Premium.
Finally, define business context for these facts (dimensions). In your case, they are Policy and Product (most likely, you will also have some kind of Date).
Your resulting fact table should look something like this:
Policy_Date
Policy_Number
Product_ID
Insured_Value
Gross_Premium
Net_Premium
Policy_Date will provide connection to "Calendar" dimension, Product_ID will connect to "Product" dimension (table that contains your 30 products and their descriptions).
Policy_Number is what's called a "Degenerate Dimension" - it's an ID that is usually not connected to any dimensions (but could if you need to). It's stored in a fact table just as a reference. Some people add "Policy" dimension to the model, but usually it's a design mistake - such dimensions are too "tall", comparable in size to the fact table, which can dramatically slow down your model performance. It's usually better to split policy attributes into multiple small dimesions and leave the policy number as a degenerate dimension.
So, your typical policy with 5 products will be represented as 5 records in the fact table, rather than one record with 5 fields. This is the critical difference - never, ever store information (products in your case) in the name of the fact table fields.

Entity Relationship Diagram: How to create a Yelp-kind of app with not just one price-range?

Im new to Rails and I'm in the middle of sketching up an ERD for my new app. A Yelp-sort of app, where a Client is sorted by price.
So I want one Client to have many priceranges - One Client can both have pricerange $ and Pricerange $$$$ for example. The priceranges are:
$ - $$ - $$$ - $$$$ - $$$$$
How would this look in a table? Would I create a table called PriceRange with Range1, Range2, Range3, Range4, Range5 to be booleans?
Doesn't the PriceRange-table need any foreign/primary keys?
PriceRange
Range1 (Boolean)
Range2 (Boolean)
Range3 (Boolean)
Range4 (Boolean)
Range5 (Boolean)
Look, I'm Brazilian and I'm not very knowledgeable about yelp applications. I do not quite know what it is, but from what I saw, they are systems to assess/measure/evaluate (perhaps the translation is wrong here for you) things, in this case, companies, right?
Following this logic, let's think...
By the description of your problem (context), you have clients (companies), and they can have price ranges, correct? If:
A price interval is represented by textual names, such as "$", "$$",
and so on,
and the same price range may have (numeric) values for different companies,
And the same price range (type) can be (or not) assigned to different
companies,
Then here is what we have:
By decomposing this conceptual model, you would end up with three tables:
Companies
Price Ranges
Price Ranges from Companies
The primary keys of Company and Price Ranges will be passed to Price Ranges from Companies as foreign keys. You can use them as a composite primary key, or use a surrogate key. If using a surrogate key, you will permit/allow a company to have the same kind of price range more than once, which I believe is not the case.
Let's look at another situation, if things are simpler as:
If there is no need to store prices,
and an company may have or not one or more price ranges represented by "$", "$$", and so on,
Then here is what we have:
Similarly, we'll have the same 3 tables. Likewise, you still must pass the primary keys of Companies and Price Ranges to Price Ranges from Companies as foreign keys.
So I want one Client to have many priceranges - One Client can both
have pricerange $ and Pricerange $$$$ for example
Notice how N-N relationships allow us to create optional relationships between entities. This will allow a company to have zero, one, two, (etc.) or all price ranges defined. Again, so that is not allowed a company to have a price range more than once, set the foreign keys as composite primary key in Price Ranges from Companies.
If you have any questions or anything I explained has nothing to do with your context, please do not hesitate to comment.
EDIT
Is the Price ranges from companies what is called a Joint table?
Yes. There are also other terms used, some in different areas of computer science, such as Link Table, or Intermediate Table.
Actually we do not have a table here in the diagram, but an entity. In the Conceptual Model there are no tables, but entities and relationships. Be careful with this terminology when developing the Conceptual Model, or else you may get confused (I say this from experience).
However, yes, once decomposed, we will have a table from this relationship. When decomposed, N-N relationships will always become tables, no exception. Differently, 1-1 and 1-N (or N-1) relationships do not become tables. These tables with these special names (Join/Link/Intermediate Tables) serves to associate records from different tables, hence the name.
And is it necessary to have a column called Price Range Id? I mean
what is it there for?
At where? If you say at the Price Ranges entity, it is rather necessary. Must We not identify records in a table in some way? Here I set what is called a Surrogate Key. If on the other hand, you have a column with unique values for each record in the table, you can also use this column. I highly recommend that you consider the use of surrogate keys. Read the link I gave you.
In the Conceptual Model, we have to define the properties and also the primary keys. During the phase of the conceptual model, natural attributes of entities can become primary keys if you so desire. In this case, we have what is called a Natural Key.
If on the other hand you refer to Price Ranges from Companies entity, so the question is another ("And is it necessary to have a column called Price Range Id?"). Here we have a table with two columns, as I told you. The two are foreign keys. You need it so you can relate rows from the two tables... I think you were not referring to that, is not it? If so, no problem, you can comment and ask more questions. I do not care to answer. To be honest, I did not quite understand your question.
EDIT 2
So that Company 28 can be identified in the Price Ranges (for instance
ID 40) Which would make it easier to call out the price ranges it has?
Maybe my English is not very good, but it seems to me that you have a beginner's doubt/question in relation to the concept of tables and relationships between them. If not that, I apologize because maybe I did not understand. But let's see...
The tables in a database have rows / records. Each line has its own data. Even with this, each line / record needs to be differentiated and identified somehow. That is why we attach to each line an identifier, known as the primary key (this, and this). In summary, the primary key is how we identify, differentiate, separate and organize different records.
Even if all records have different values, you must select a field (column) that represents the primary key of the table. By obligation, every record MUST have a primary key. Although you can choose which field is a primary key, you are allowed to choose one or more fields to serve as the primary key. When this happens, that is, when more than one field participates/serves as the primary key, we have a table with something called Composite Primary Key. Similarly, it has the ability to identify records. Note that, because of that, primary key values must be unique, otherwise you may have 2 identical records.
This is the basic concept so that we can relate tables to each other, in case, records/rows of tables together. If we have a Company identified by the ID 28 (a line/record), and we want to relate it to a Price Range identified by the ID 40, then we need to store somewhere that relationship (28 <--> 40). This is where the role of intermediate/link/join tables comes in (but only to relationships N-N! For 1-N or N-1 relationships it works similarly, but not identical).
My original question was whether it was necessary, and why a company
ID had to link up with a price range ID at all.
With this table storing records which relates to other records (for their primary keys), we can perform a SQL join operation (If you have questions about this, see this image). Depending on how you perform this operation, you'll get:
All companies that have Price Ranges.
All companies that do not have Price Ranges.
All the Price Ranges of a given company.
All companies that have or not a X Price Range.
All price ranges that are given or not to companies.
...
Anyway, you get all this because of the established relationship.
If it could just be taken out and then the table of price ranges would
only involve Pricerange1-5.
This sentence I did not understand. What should be taken out? Could you please explain this sentence better?

Should I flatten multiple customer into one row of dimension or using a bridge table

I'm new to datawarehousing and I have a star schema with a contract fact table. It holds basic contract information like Start date, end date, amount ...etc.
I have to link theses facts to a customer dimension. there's a maximum of 4 customers per contract. So I think that I have two options either I flatten the 4 customers into one row for ex:
DimCutomers
name1, lastName1, birthDate1, ... , name4, lastName4, birthDate4
the other option from what I've heard is to create a bridge table between the facts and the customer dimension. Thus complexifying the model.
What do you think I should do ? What are the advantages / drawbacks of each solution and is there a better solution ?
I would start by creating a customer dimension with all customers in it, and with only one customer per row. A customer dimension can be a useful tool by itself for CRM and other purposes and it means you'll have a single, reliable list of customers, which makes whatever design you then implement much easier.
After that it depends on the relationship between the customer(s) and the contract. The main scenarios I can think of are that a) one contract has 4 customer 'roles', b) one contract has 1-4 customers, all with the same role, and c) one contract has 1-n customers, all with the same role.
Scenario A would be that each contract has 4 customer roles, e.g. one customer who requested the contract, a second who signs it, a third who witnesses it and a fourth who pays for it. In that case your fact table will have one row per contract and 4 customer ID columns, each of which references the customer dimension:
...
RequesterCustomerID int,
SignatoryCustomerID int,
WitnessCustomerID int,
BillableCustomerID int,
...
Of course, if one customer is both a requester and a witness then you'll have the same ID in both RequesterCustomerID and WitnessCustomerID because you only have one row for him in your customer dimension. This is completely normal.
Scenario B is that all customers have the same role, e.g. each contract has 1-4 signatories. If the number of signatories can never be more than 4, and if you're very confident that this will 'always' be true, then the simple solution is also to have one row per contract in the fact table with 4 columns that reference the customer dimension:
...
SignatoryCustomer1 int,
SignatoryCustomer2 int,
SignatoryCustomer3 int,
SignatoryCustomer4 int,
...
Even if most contracts only have 1 or 2 signatories, it's not doing much harm to have 2 less frequently used columns in the table.
Scenario C is where one contract has 1-n customers, where n is a number that varies widely and can even be very large (class action lawsuit?). If you have 50 customers on one contract, then adding 50 columns to the fact table becomes difficult to manage. In this case I would add a bridge table called ContractCustomers or whatever that links the fact table with the customer dimension. This isn't as 'neat' as the other solutions, but a pure star schema isn't very good at handling n:m relationships like this anyway.
There may also be more complex cases, where you mix scenarios A and C: a contract has 3 requesters, 5 signatories, 2 witnesses and the bill is split 3 ways between the requesters. In this case you will have no choice but to create some kind of bridge table that contains the specific customer mix for each contract, because it simply can't be represented cleanly with just one fact and one dimension table.
Either way can work but each solution has different implications. Certainly you need customer and contract tables. A key question is: is it always a maximum of four or may it eventually increase beyond that? If it will stay at 4, then you can have a repeating group of 4 customer IDs in the contract. The disadvantage of this is that it is fixed. If a contract does not have four, there are some empty spaces. If, however, there might be more than 4, then the only viable solution is to use a bridge table because in a bridge table you add more customers by inserting new rows (with no change to the table structure). In the fixed solution, in this case you add more than 4 customers by altering the table. A bridge table is an example of what, for many decades now, ER modeling has called an associative entity. It is the more flexible of the two solutions. However, I worked on a margin system once wherein large margin amounts needed five levels of manager approval. It has been five and will always be five, they told me. Each approving manager represented a different organizational level. In this case, we used a repeating group of five manager IDs, one for each level, and included them in the trade. So it is important to understand the current business rules and the future outlook.

Zip code based search

I have an app where shop owners can enter 10 zip codes in which they can provide services. Currently these zip codes are stored in single table column. Now what is the best and efficient way to do search based on this? Should I store all zip codes(all US zip codes) in a table and establish many to many relationship or do text search based on the current field using thinking sphinx?
A database guy's perspective . . .
Since you're talking about using Sphinx, I presume you store all 10 ZIP codes in a single row, like this.
shop_id zip_codes
--
167 22301, 22302, 22303, 22304, 22305, 22306, 22307, 22308, 22309, 22310
You'd be far better off storing them like this, for search and for several other reasons.
shop_id zip_codes
--
167 22301
167 22302
167 22303
167 22304
167 22305
167 22306
167 22307
167 22308
167 22309
167 22310
-- Example in SQL.
create table serviced_areas (
shop_id integer not null references shops (shop_id), -- Table "shops" not shown.
zip_code char(5) not null,
primary key (shop_id, zip_code)
);
You can make a good case for stopping after making this single change.
But you can increase data integrity substantially without making any other changes to your database if your dbms supports regular expressions. With that kind of dbms support, you can guarantee that the zip_code column contains only 5 integers, no letters. (There may be other ways to guarantee 5 integers and no letters.)
A table of ZIP codes would further increase data integrity. But you could easily argue that the shop owners have a vested interest in entering valid ZIP codes in the first place, and that this isn't worth more effort on your part. ZIP codes change pretty often; don't expect a "complete" table of ZIP codes to be accurate for very long. And you need to have a well-defined procedure for dealing with both new and expired ZIP codes.
-- Example in SQL
create table zip_codes (
zip_code char(5) primary key
);
create table serviced_areas (
shop_id integer not null references shops (shop_id),
zip_code char(5) not null references zip_codes (zip_code),
primary key (shop_id, zip_code)
);
You will need the zipcodes plus latitude/longitude in your db if you're using sphinx to do geospatial search (not really, you can use a text file or xml I suppose).
By geospatial search I mean something like "Find stores within 20 miles of your location"
For flexibility and efficiency, I would pick #1 ....
"store all zip codes in a table and establish many to many
relationship"
...with the assumption you also need to store other zip code data fields (City, State, county, Lat/Long, etc.). In that case your intersection would be: shop_id to zipcode_id(s). However, if you do not need/have extended ZIP Code data fields, then a single separate table with shop_id to the acutal zipcodes (not id) will be fine in my opinion.

Resources