Question regarding role-playing dimension - data-warehouse

I hope you can be helpful in answering one question in regards to role-playing dimensions.
When using views for a role playing dimension, Does it then matter which view is referred to later in the analysis. Especially, when sorting on the role playing dimension, can this be done no matter which view is used?
Hope the question is clear enough. If not, let me know and I will elaborate.
Thanks in advance.

Do you mean you have created a view similar to "SELECT * FROM DIM" for each role the Dim plays? If that's all you've done then you could use any of these views in a subsequent SQL statement that joins the DIM to a FACT table - but obviously if you use the "wrong" view it's going to be very confusing for anyone trying to read your SQL (or you trying to understand what've you've written in 3 months time!)
For example, if you have a fact table with keys OrderDate and ShipDate that both reference your DateDim then you could create vwOrderDate and vwShipDate. You could then join FACT.OrderDate to vwShipDate and FACT.ShipDate to vwOrderDate and it will make no difference to the actual resultset your query produces (apart from, possibly, column names).
However, unless the applicable attributes are very different for different roles, I really wouldn't bother creating views for role-playing Dims as it's an unnecessary overhead and just going to cause confusion to anyone you've given access to at this level of the DB (who presumably have pretty strong SQL skills to be given this level of access?).
If you are trying to make life easier for end-users then either create these types of "views" in the models of the BI tool(s) they are using - and not directly in the DB - or, if they are being given access to the DB, then create View(s) across the Fact(s) and all their joined Dimensions

Related

How to change model lazyloadness at runtime in Symfony?

I use sfPropelORMPlugin.
Lazyload is ok if I operate on one object per web page. But if there are hundreds I get hundreds of separate DB queries. I'd like to completely disable lazyload or disable it for needed columns on those particularly heavy pages but couldn't find a way so far.
You should join all your relations when you build your query, that way you'll get all data in a single query. Note, you have to use joinWithRelation() where Relation is a related table name.
Elaborating on William Durand's answer, perhaps you should also look at the Propel function doSelectjoinAll(), which should pre-load all of the objects related to your relations. Just keep in mind this can be expensive as it relates to memory.
Another technique is to create a custom criteria with your needed joins, then use a manual hydrate technique to add on to your base object. I do this often when the data I need is using aggregates or other columns that are not exactly mapped to objects. There are plenty of hydrate() examples around.
Added utility method to peer to be able to set what columns I want to load. Using "pseudo columns" for this type of DB queries. Also I have overridden hydrate() to understand this "markup". All were good until I found out that even though data is hydrated symfony won't understand it and won't let you use it as intended.
PS join was never considered as an option because site is kind of high load.

DB-agnostic Calculations : Is it good to store calculation results ? If yes, what's the better way to do this?

I want to perform some simple calculations while staying database-agnostic in my rails app.
I have three models:
.---------------. .--------------. .---------------.
| ImpactSummary |<------| ImpactReport |<----------| ImpactAuction |
`---------------'1 *`--------------'1 *`---------------'
Basicly:
ImpactAuction holds data about... auctions (prices, quantities and such).
ImpactReport holds monthly reports that have many auctions as well as other attributes ; it also shows some calculation results based on the auctions.
ImpactSummary holds a collection of reports as well as some information about a specific year, and also shows calculation results based on the two other models.
What i intend to do is to store the results of these really simple calculations (just means, sums, and the like) in the relevant tables, so that reading these would be fast, and in a way that i can easilly perform queries on the calculation results.
is it good practice to store calculation results ? I'm pretty sure that's not a very good thing, but is it acceptable ?
is it useful, or should i not bother and perform the calculations on-the-fly?
if it is good practice and useful, what's the better way to achieve what i want ?
Thats the tricky part.At first, i implemented a simple chain of callbacks that would update the calculation fields of the parent model upon save (that is, when an auction is created or updated, it marks some_attribute_will_change! on its report and saves it, which triggers its own callbacks, and so on).
This approach fits well when creating / updating a single record, but if i want to work on several records, it will trigger the calculations on the whole chain for each record... So i suddenly find myself forced to put a condition on the callbacks... depending on if i have one or many records, which i can't figure out how (using a class method that could be called on a relation? using an instance attribute #skip_calculations on each record? just using an outdated field to mark the parent records for later calculation ?).
Any advice is welcome.
Bonus question: Would it be considered DB agnostic if i implement this with DB views ?
As usual, it depends. If you can perform the calculations in the database, either using a view or using #find_by_sql, I would do so. You'll save yourself a lot of trouble: you have to keep your summaries up to date when you change values. You've already met the problem when updating multiple rows. Having a view, or a query that implements the view stored as text in ImpactReport, will allow you to always have fresh data.
The answer? Benchmark, benchmark, benchmark ;)

entity framework ref, deref, createref [duplicate]

This question is about why I would use the above keywords. I've found plenty of MSDN pages that explain how. I'm looking for the why.
What query would I be trying to write that means I need them? I ask because the examples I have found appear to be achievable in other ways...
To try and figure it out myself, I created a very simple entity model using the Employee and EmployeePayHistory tables from the AdventureWorks database.
One example I saw online demonstrated something similar to the following Entity SQL:
SELECT VALUE
DEREF(CREATEREF(AdventureWorksEntities3.Employee, row(h.EmployeeID))).HireDate
FROM
AdventureWorksEntities3.EmployeePayHistory as h
This seems to pull back the HireDate without having to specify a join?
Why is this better than the SQL below (that appears to do exactly the same thing)?
SELECT VALUE
h.Employee.HireDate
FROM
AdventureWorksEntities3.EmployeePayHistory as h
Looking at the above two statements, I can't work out what extra the CREATEREF, DEREF bit is adding since I appear to be able to get at what I want without them.
I'm assuming I have just not found the scenarios that demostrate the purpose. I'm assuming there are scenarios where using these keywords is either simpler or is the only way to accomplish the required result.
What I can't find is the scenarios....
Can anyone fill in the gap? I don't need entire sets of SQL. I just need a starting point to play with i.e. a brief description of a scenario or two... I can expand on that myself.
Look at this post
One of the benefits of references is that it can be thought as a ‘lightweight’ entity in which we don’t need to spend resources in creating and maintaining the full entity state/values until it is really necessary. Once you have a ref to an entity, you can dereference it by using DEREF expression or by just invoking a property of the entity
TL;DR - REF/DEREF are similar to C++ pointers. It they are references to persisted entities (not entities which have not be saved to a data source).
Why would you use such a thing?: A reference to an entity uses less memory than having the DEFEF'ed (or expanded; or filled; or instantiated) entity. This may come in handy if you have a bunch of records that have image information and image data (4GB Files stored in the database). If you didn't use a REF, and you pulled back 10 of these entities just to get the image meta-data, then you'd quickly fill up your memory.
I know, I know. It'd be easier just to pull back the metadata in your query, but then you lose the point of what REF is good for :-D

Would like to Understand 6NF with an Example

I have just read #PerformanceDBA's arguments re: 6NF and E-A-V. I am intrigued. I had previously been skeptical of 6NF as it was presented as "merely" sticking some timestamp columns on tables.
I have always worked with a data dictionary and do not need to be convinced to use one, or to generate SQL code. So I expect an answer that would require a dictionary (or catalog) that is used to generate code.
So I would like to know how 6NF would deal with an extremely simple example. A table of items, descriptions and prices. The prices change over time.
So anyway, what does the Items table look like when converted to 6NF? What is the "explosion of tables?" that happens here?
If the example does not work with a table this simple, feel free to add what is necessary to get the point across.
I actually started putting an answer together, but I ran into complications, because you (quite understandably) want a simple example. The problem is manifold.
First I don't have a good idea of your level of actual expertise re Relational Databases and 5NF; I don't have a starting point to take up and then discuss the specifics of 6NF,
Second, just like any of the other NFs, it is variegated. You can just barely step into it; you can implement 6NF for certan tables; you can go the full hog on every table, etc. Sure there is an explosion of tables, but then you Normalise that, and kill the explosion; that's an advanced or mature implementation of 6NF. No use providing the full or partial levels of 6NF, when you are asking for the simplest, most straight-forward example.
I trust you understand that some tables can be "in 5NF" while others are "in 6NF".
So I put one together for you. But even that needs explanation.
Now SQL barely supports 5NF, it does not support 6NF at all (I think dportas says the same thing in different words). Now I implement 6NF at a deep level, for performance reasons, simplified pivoting (of entire tables; any and all columns, not the silly PIVOT function in MS), columnar access, etc. For that you need a full catalogue, which is an extension to the SQL catalogue, to support the 6NF that SQL does not support, and maintain data Integrity and business Rules. So, you really do not want to implement 6NF for fun, you only do that if you have a need, because you have to implement a catalogue. (This is what the EAV crowd do not do, and this is why most EAV systems have data integrity problems. Most of them do not use the declarative Referential & Data Integrity that SQL does have.)
But most people who implement 6NF don't implement the deeper level, with a full catalogue. They have simpler needs, and thus implement a shallower level of 6NF. So, let's take that, to provide a simple example for you. Let's start with an ordinary Product table that is declared to be in 5NF (and let's not argue about what 5NF is). The company sells various different kinds of Products, half the columns are mandatory, and the other half are optional, meaning that, depending on the Product Type, certain columns may be Null. While they may have done a good job with the database, the Nulls are now a big problem: columns that should be Not Null for certain ProductTypes are Null, because the declaration states NULL, and their app code is only as good as the next guy's.
So they decide to go with 6NF to fix that problem, because the subtitle of 6NF states that it eliminates The Null Problem. Sixth Normal Form is the irreducible Normal Form, there will be no further NFs after this, because the data cannot be Normalised further. The rows have been Normalised to the utmost degree. The definition of 6NF is:
a table is in 6NF when the row contains the Primary Key, and at most one, attribute.
Notice that by that definition, millions of tables across the planet are already in 6NF, without having had that intent. Eg. typical Reference or Look-up tables, with just a PK and Description.
Right. Well, our friends look at their Product table, which has eight non-key attributes, so if they make the Product table 6NF, they will have eight sub-Product tables. Then there is the issue that some columns are Foreign Keys to other tables, and that leads to more complications. And they note the fact that SQL does not support what they are doing, and they have to build a small catalogue. Eight tables are correct, but not sensible. Their purpose was to get rid of Nulls, not to write a little subsytem around each table.
Simple 6NF Example
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notation useful in order to interpret the symbols in the example.
So typically, the Product Table retains all the Mandatory columns, especially the FKs, and each Optional column, each Nullable column, is placed in a separate sub-Product table. That is the simplest form I have seen. Five tables instead of eight. In the Model, the four sub-Product tables are "in 6NF"; the main Product table is "in 5NF".
Now we really do not need every code segment that SELECTs from Product to have to figure out what columns it should construct, based on the ProductType, etc, so we supply a View, which essentially provides the 5NF "view" of the Product table cluster.
The next thing we need is the basic rudiments of an extension to the SQL catalog, so that we can ensure that the rules (data integrity) for the various ProductTypes are maintained in one place, in the database, and not dependent on app code. The simplest catalogue you can get away with. That is driven off ProductType, so ProductType now forms part of that Metadata. You can implement that simple structure without a catalogue, but I would not recommend it.
Update
It is important to note that I implement all Business Rules in the database. Otherwise it is not a database (the notion of implementing rules "in application code" is hilarious in the extreme, especially nowadays, when we have florists working as "developers"). Therefore all rules, etc are first and foremost implemented as SQL declarations, CHECK constraints, functions, etc. That preserves all Declarative Referential Integrity, and declarative Data Integrity. The extension to the SQL catalog covers the area that SQL does not have declarations for, and they are then implemented as SQL. Being a good data dictionary, it does much more. Eg. I do not write Views every time I change the tables or add or change columns or their characteristics, they are created directly from the catalog+extension using a simple code generator.
One more very important note. You cannot implement 6NF (or EAV properly, for that matter), without completing a full and faithful Normalisation exercise, to 5NF. The problem I see at every site is, they don't have a genuine 5NF state, they have a mish-mash of partial normalisation or no normalisation at all, but they are very attached to that. Creating either 6NF or EAV from that is a disaster. Creating EAV or 6NF from that without all business rules implemented in declarative SQL is a nuclear disaster, burning for years. You get what you pay for.
End update.
Finally, yes, there are at least four further levels of Normalisation (Normalisation is a Principle, not a mere reference to a Normal Form), that can be applied to that simple 6NF Product cluster, providing more control, less tables, etc. The deeper we go, the more extensive the catalogue. And higher levels of performance. When you are ready, just ask, I have already erected the models and posted details in other answers.
In a nutshell, 6NF means that every relation consists of a candidate key plus no more than one other (key or non-key) attribute. To take up your example, if an "item" is identified by a ProductCode and the other attributes are Description and Price then a 6NF schema would consist of two relations (* denotes the key in each):
ItemDesc {ProductCode*, Description}
ItemPrice {ProductCode*, Price}
This is potentially a very flexible approach because it minimises the dependencies. That's also its main disadvantage however, especially in a SQL database. SQL makes it hard or impossible to enforce many multi-table constraints. Using the above schema, in most cases it will not be possible to enforce a business rule that every product must always have a description AND a price. Similarly, you may not be able to enforce some compound keys that ought to apply (because their attributes could be split over multiple tables).
So in considering 6NF you have to weigh up which dependencies and integrity rules are important to you. In many cases you may find it more practical and useful to stick to 5NF and normalize no further than that.
I had previously been skeptical of 6NF
as it was presented as "merely"
sticking some timestamp columns on
tables.
I'm not quite sure where this apparent misconception comes from. Perhaps the fact that 6NF was introduced for the book "Temporal Data and The Relational Mode" by Date, Darwen and Lorentzos? Anyhow, I hope the other answers here have clarified that 6NF is not limited to temporal databases.
The point I wanted to make is, although 6NF is "academically respectable" and always achievable, it may not necessarily lead to the optimal design in every case (and not just when considering implementation using SQL either). Even the aforementioned discoverers and proponents of 6NF seem to agree e.g.
Chris Date: "For practical purposes, stick to 5NF (and 6NF)."
Hugh Darwen: "the 6NF decomposition around Date [not the person!] would be overkill... an optimal design for the soccer club is... 5-and-a-bit-NF!"
Hugh Darwen: "we are in 5NF but not in 6NF, and again 5NF is sufficient" (several similar examples).
Then again, I can also find evidence to the contrary:
Chris Date: "Darwen and I have both felt for some time that all base relvars should be in 6NF".
On a practical note, I recently extended the SQL schema of one of our products to add a minor feature. I adopted a 6NF to avoid nullable columns and ended up with six new tables where most (all?) of my colleagues would have used one table (or perhaps extended an existing table) with nullable columns. Despite me proving several 'helper' stored procs and a 'denormalized' VIEW with a INSTEAD OF triggers, every coder that has had to work with this feature at the SQL level has gone out of their way to curse me :)
These guys have it down: Anchor Modeling. Great academic papers on the subject, combined with practical examples. Their writings have finally pushed me over the edge to consider building a DW in 6nf on an upcoming project. The POC work I have done has validated (for me, at least) that the enormous benefits of 6nf don't outweigh the costs.

Can someone help me understand why an auto-identity (int) is bad when using NHibernate?

I've been seeing a lot of commentary (from an NHibernate perspective) about using Guid as opposed to an int (and presumably auto-id in the database), with the conclusion that using auto-identity breaks the UoW pattern.
This post has a short description of the issue, but it doesn't really tell me "why" it breaks the pattern (unless I'm misunderstanding which is likely the case.
Can someone enlighten me?
There are a few major reasons.
Using a Guid gives you the ability to identify a single entity across many databases, including six relational databases with the same schema but different data, a document database, etc. This becomes important any time you have more than one single place where data goes - and that means your case too: you have a dev database and a prod database, right?
Using a Guid gives NHibernate the ability to batch more statements together, perform more database work at the very end of the unit of work / transaction, and reduce the total number of roundtrips to the database, increasing performance as well as conferring other benefits.
Comment:
Random Guids do not create poor indexes - natively, they create poor clustered indexes. There are two solutions.
Use a partially sequential Guid. With NHibernate, this means using the guid.comb id generator rather than the guid id generator. guid.comb is partially sequential for good performance, but retains a very high degree of randomness.
Have your Guid primary key be a nonclustered index, and put a clustered index on another auto-incrementing column. You may decide to map this column, in which case you lose the benefit of better batching and fewer roundtrips, but you regain all the benefits of short numbers that fit easily in a URL. Or you may decide not to map this column and have it remain completely within the database, in which case you gain better performance for Guids as primary keys as well as better performance for NHibernate doing fewer roundtrips.
My take would be that the key breaking factor is that getting the auto-incremented value requires an actual write to the database, which nHibernate would have deferred or possibly never performed.
Using identity and in a parent-child scenario the database has round trip the database to get the ID of a parent so that it can associate the child correctly. This means that the parent has to be committed at this time. Should there be a problem with the child you would then need to delete the parent in order to exit the UoW correctly.

Resources