Problems with Column in Fact Table - data-warehouse

I'm building a DW just like the one from AdventureWorks. I have one fact table called FactSales and theres a table in the database called SalesReason that tells us the reason why a certain costumer buys our product.
The thing is there are two types of costumers - the resselers and the online customers - and only the online customers have a sales reason linked to them.
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
And Second,
Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Should I just put the reason in the fact sales without it being a key, just like a technical description that is nullable?
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
In a solution I saw from the adventure works tutorial, it's created a new fact table called fact_reason. It Links the factSales with a DimReason.
That looks like a good solution, but I don't know how it works, because I never lerned in my classes that I could link a fact to a fact, thus I wouldn't be able to justify my option to my teacher.
If you could explain it I would appreciate it.
Thanks!

Please find my comments for your questions:
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
Yes the dimension in this case would be Dim_Customer(for eg) and this could be a role playing dimension. You can expose reporting views to separate the Online customer and Reseller customer
And Second, Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Yes it would make sense to build a reason dimension. In this you can tag a fact record to the reason
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
I would suggest you keep one fact as your business activity is sales, you can add context to it, online or reseller using your dimensions. If you would prefer you can have separate Dim_Sales dimension to include the sales type and other details of the sales which you cannot include in the dact
To summarise you probably might be well off with the following facts:
Fact_Sales linked to
Dim_Customer
Dim_Sales
Dim_Reason (This can also may be go to the Dim_Sales)
Dim_Date(always include a date dimension when you build a DWH solution)
Hope that helps...

Related

HR Data Mart Design Advice

I am working on a design for an HR data mart using the Kimball approach outlined in 'The Data Warehouse Toolkit'.
As per the Kimball design, I was planning to have a time-stamped, slowly-changing dimension to track employee profile changes (to support point-in-time analysis of employee state) and a head-count periodic snapshot fact table to support measures of new hires, leavers, leave taken, salary paid etc.
The problem I've encountered is that, in some cases, our employees can be assigned to multiple roles/jobs and each one needs to be tracked separately (i.e. the grain of my facts has to be at job-level, not employee level).
How might the Kimball design be adapted to fit a scenario where employee and role/job form a hierarchy like this? Ideally, I want to avoid duplicating employee profile data (address, demographics etc) for each role/job an employee is assigned to, but does this mean I need to snow-flake the dimension?
Options I've been considering include the below - I'd be interested in any thoughts or suggestions the community has on this so all input is welcome!
1) (see attached, design 1) A snowflake-style approach with an employee table which has a 1-to-Many link role table, which, in turn, has a 1-to-many link with the fact table. The advantage here is a clean employee dimension but I don't want to introduce unnecessary complexity. Is there any reason why I shouldn't link both dimensions directly to the fact table? The snowflake designs I've seen don't seem to do this.
2) (see attached, design 2) A combined Employee/Role dimension where each employee has a record for each assigned role but only one on them is flagged as 'Primary Role'. Point-in-time queries on the dimension can be performed by constraining on the 'Primary Role' flag.
Anything that occurred is an event and can be a fact. When you look at relationships between data, you need to also ask if the data value describes the entity (dim) or something that happened to/with the entity(fact). Everything can be a dim or a fact.(sometimes both)
A job describes an event that happened to the employee. You should have a fact employeejob that relates to the Dim employee and Dim job (as well as your date dimensions). This will then allow you to break down absences by employee and job. Your dim job would really just be job title, pay grades, etc. The fact would contain effective dates. Research factless fact tables.
Note that your vacancy reference would be part of a separate fact (when/where did you post it, how many applicants are all measurable facts about the vacancy). This may also be an example of a degenerate dimension.
I'm not fond of your monthly fact. I think that should just be some calculated measures built on fact absence and fact employeejob. When those events are put up against your dimensions, you can break them down by date, job type, manager, etc.

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

normalization 1NF or 3NF

though after reading many articles online and I know this question have been asked quite a number of time. I'm still having problem identifying if a relation table is in 1NF, 2NF or 3NF
I've found an example as below
Students are involved in many project, and each project may have
many employee working on it. The number of hours each Students
works on a project, and the start date on which the students starts
working on the project are saved in the following relational table.
StudProject (StudNum, ProjNum, HoursWork,
DateStartWorkOnProj)
I've tried breaking them into the following on my own which i'm not sure if i'm right
StudNum, ProjNum --> HoursWork, DateStartWorkOnProj
StudNum --> ProNum
ProNum --> HoursWork, DateStartWorkOnProj
so it actually has a transitive dependency so in this case it should be under 2NF? or should it be 3NF since the hourswork and datestartworkonproj actually depends on StudNum and ProjNum..
if you have only these data for every project, i think this table is good.
StudProject (StudNum, ProjNum, HoursWork,
DateStartWorkOnProj)
but if you want to store more information about project and work time, this table must be extended:
StudProject (StudNum, ProjNum)
projectWork (StudNum, ProjNum, workTime, startDateTime, endDateTime)
in the projectWork table, each record show a work day of student and difference of start, ned saved in workTime. sum(workTime) for each student in a project shows a total work of him.

How does Master/Detail work? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have found some tutorials, but they still leave me with questions.
Let's take a classic example of 2 tables, one for customer details and one for order details.
The customers table in the database has:
an autoincrementing integer customer_id as primary key
a text field for customer name
a text field for contact details
And the orders table has:
an integer customer_id which is a foreign key referencing the customers table
some other stuff, such a reference to a bunch of item numbers
an integer order_value to store the cash value of the order
I need two dataset components, two queries and a connection.
So far, so good? Or did I miss something already?
Now, the tutorials say that I have to set the MasterSource of the of the datasource which coresponds to the DB grid showing the orders table to be the datasource which corresponds to the customers table and the MasterFields, in this case, to customer_id.
Anything else? Should I for instance set the Detailfields of the query of the datasource which corresponds to the customers table to customer_id?
Should I use the properties, or a paramaterized query?
Ok, at this point, we have followed the classic tutorials and can scroll through the customers DB grid and see all orders for the current customer shown in the orders DB grid. When the user clicks the customers DB grid I have to Close(); then Open(); the orders query to refresh its corresponding DB grid.
However, those tutorials always seem to posit a static database with existing contents which never change.
When I asked anothter question, I gave an example where I was using a Command to INSERT INTO orders... and was told that that is A Bad Thing` and I should:
OrdersQuery.Append();
OrdersQuery.FieldByName('customer_id') := [some value]';
OrdersQuery.FieldByName('item_numbers') := [some value]';
OrdersQuery.FieldByName('order_value') := [some value]';
OrdersQuery.Post();
Is that correct?
I ask because it seems to me that a Command puts data in and a query should only take it out, but I can see that a command doesn't have linkage to the DB grid via a datasource's query.
Is this a matter of choice, or must the query be used?
If so, it seems that I can't use even simple SQL functions such as SUM, MIN< AVG, MAX in the query and have to move those into my code.
If I must use the query, how do I implement SQL UPDATE and DROP?
And, finally, can I have a Master/Detail/Detail relationship?
Let's say I want a 3rd DB grid, which shows the total and average of all orders for a customer. It gets its data from the orders table (but can't use SUM and AVG) which is updated each time the user selects a different customer, thus giving a Master/Detail/Detail relationship. DO I just set that up as two Master/Detail relationships? I.E, the DB grid, datasource, query for the total and average orders refers only to orders and has no reference to customers, even if it does use customer_id?
Thanks in advance for any help and clarification. I hope that this question will become a reference for others in the future (so, feel free to edit it).
TLDR: In the SQL world, Master/Detail is an archaism.
When some people say "Master Detail" they aren't going to go all the way down the rabbit hole. Your question suggests you do want to. I'd like to share a few things that I think are helpful, but I don't see that anyone can really answer your questions completely.
A minimal implementation of master detail, for any two datasets, for some people's purposes, is nothing more than an event handler firing when the currently selected row in the master table changes. This row is then used to filter the rows in the detail table dataset, so that only the rows that match the primary key of the master row are visible. This is done for you, if you configure it properly, in most of the TTable-like objects in Delphi's VCL, but even Datasets that do not explicitly support master/detail configurations can be made to function this way, if you are willing to write a few event handlers, and filter data.
At one of my former employers, a person had invented a Master Detail controller component, which along with a little known variant of ADO-components for Delphi known as Kamiak, and it had some properties which people who are only familiar with the BDE-TTable era concept of master detail would not have expected. It was a very clever bit of work, it had the following features:
You could create an ADO recordset and hold it in memory, and then as a batch, write a series of detail rows, all at once, if and only if the master row was to be stored to the disk.
You could nest these master-detail relationships to almost arbitrary depths, so you could have master, detail and sub-detail records. Batch updates were used for UPDATES, to answer that part of your question. To handle updates you need to either roll your own ORM or Recordset layer, or use a pre-built caching/recordset layer. There are many options, from ADO, to the various ORM-like components for Delphi, or even something involving client-datasets or a briefcase model with data pumps.
You could modify and post data into an in-memory staging area, and flush all the master and detail rows at once, or abandon them. This allowed a nearly object-relational level of persistence management.
As lovely as the roll-your-own-ORM approach seems above, it was not without it's dark side. Strange bugs in the system lead me to never want to ever use such an approach again. I do not wish to overstate things, but can I humbly suggest that there is such a thing as going too far down the master-detail rabbit-hole? Don't go there. or if you do, realize that you're really building a mini ORM, and be prepared to do the work, which should include a pretty solid set of unit tests and integration tests. Even then, be aware that you might discover some pretty strange corner cases, and might find that a few really wicked bugs are lurking in your beautiful ORM/MasterDetail thing.
As far as inserts go, that of course depends on whether you are a builder, or a user. A person who is content to build atop whatever Table classes are in the VCL and who never wants to dirty their hands with SQL is going to think your approach is wrong-headed if you are not afraid of SQL. I wonder how that person is going to deal with auto-assigned identity primary keys, though. I store a person record in a table, and immediately I need to fetch back that person's newly assigned ID, which is an integer, and I am going to use that integer primary key now, to associate my detail rows with the master row, and the detail rows, therefore refer to the master row's ID integer, as a foreign key, because my SQL database is nicely constructed, with referential integrity constraints, and because I've thought about all this in advance and don't want to do this over and over again repeatedly, I eventually get from here, to building an object-relational-mapping framework. I hope you can see how your many questions have many possible answers, answers which have lead to hundreds or millions of possible approaches, and there is no one right one. I happen to be a disbeliever in ORMs, and I think the safe place to get off this crazy train is before you get on it. I hand code my SQL, and I hand code my business objects, and I don't use any fancy Master Detail or ORM stuff. You, however, may choose to do as you like.
What I would have implemented as "master detail" in the BDE/dBase/flat-file era, I now simply implement as a query for master rows, and a second query for detail rows, and when the master row changes, I refresh the detail rows queries, and I do not use the MasterSource or related Master/Detail properties in the the TTable-objects at all.

Performance issues with complex nested RoR reservation system

I'm designing a Ruby on Rails reservation system for our small tour agency. It needs to accommodate a number of things, and the table structure is becoming quite complex.
Has anyone encountered a similar problem before? What sort of issues might I come up against? And are performance/ validation likely to become issues?
In simple terms, I have a customer table, and a reservations table. When a customer contacts us with an enquiry, a reservation is set up, and related information added (e.g., paid/ invoiced, transport required, hotel required, etc).
So far so good, but this is where is gets complex. Under each reservation, a customer can book different packages (e.g. day trip, long tour, training course). These are sufficiently different, require specific information, and are limited in number, such that I feel they should each have a different model.
Also, a customer may have several people in his party. This would result in links between the customer table and the reservation table, as well as between the customer table and the package tables.
So, if customer A were to make a booking for a long trip for customers A,B and C, and a training course for customer B, it would look something like this.
CUSTOMERS TABLE
CustomerA
CustomerB
CustomerC
CustomerD
CustomerE
etc
RESERVATIONS TABLE
1. CustomerA
LONG TRIP BOOKINGS
CustomerA - Reservation_ID 1
CustomerB - Reservation_ID 1
CustomerC - Reservation_ID 1
TRAINING COURSE BOOKINGS
CustomerB - Reservation_ID 1
This is a very simplified example, and omits some detail. For example, there would be a model containing details of training courses, a model containing details of long trips, a model containing long trip schedules, etc. But this detail shouldn't affect my question.
What I'd like to know is:
1) are there any issues I should be aware of in linking the customer table to the reservations model, as well as to bookings models nested under reservations.
2) is this the best approach if I need to handle information about the reservation itself (including invoicing), as well as about the specific package bookings.
On the one hand this approach seems to be complex, but on the other, simplifying everything into a single package model does not appear to provide enough flexibility.
Please let me know if I haven't explained this issue very clearly, I'm happy to provide more information. Grateful for any ideas, suggestions or comments that would help me think through this rather complex database design.
Many thanks!
I have built a large reservation system for travel operators and wholesalers, and I can tell you that it isn't easy. There seems to be similarity yet still large differences in the kinds of product booked. Also, date-sensitivity is a large difference from other systems.
1) In respect to 'customers' I have typically used different models for representing different concepts. You really have:
a. Person / Company paying for the booking
b. Contact person for emergencies
c. People travelling
a & b seem like the same, but if you have an agent booking, then you might want to separate them.
I typically use a => 'customer' table, then some simple contact-fields for b, and finally for c use a 'passengers' table. These could be setup as different associations to the same model, but I think they are different enough, and I tend to separate them - perhaps use a common address/contact model.
2) I think this is fine, but depends on your needs. If you are building up itineraries for a traveller, then it makes sense to setup 'passengers' on the 'reservation', then for individual itinerary items, with links to which passenger is travelling on/using that item.
This is more complicated, and you must be careful to track dependencies, but the alternative is to not track passenger names, and simply assign quantities to each item (1xAdult, 2xChildren). This later method is great for small bookings, so it seems to depend on if your bookings are simple, or typically built up of longer itineraries.
other) In addition, in respect to different models for different product types, this can work well. However, there tends to be a lot of cross over, so some kind of common 'resource' model might be better -- or some other means of capturing common behaviour.
If I haven't answered your questions, please do ask more specific database design questions, or I can add more detail about specific examples of what I've found works well.
Good luck with the design!

Resources