What would be the overhead of using GUID's instead of an integer identity in nHibernate for table primary keys?
The main reason for use would be the obfuscation of table ids on user facing data.
Impact on the day to day level for operations like GetById, Find first 50 filtered records, Delete, Update... zero. There is no measurable difference at all.
What will be effected a lot is the Database - storage. Integer is stored in 4 bytes, Guid consumes 16. So on every million rows you will need 12 MB more of storage. In case that such an entity references other(s) it is another 12 MB per every referenced table.
And when such columns are used for indexes... spaces could grow more and more.
So on a larger scale, SQL server will have to scan larger piece of data.
For example MS SQL Server is really optimized for int types. (including bigint, shortint, tinyint).
BUT, to be honest, we were using both approaches, and performance issue was always elsewhere...
I would vote for primary keys based on int every time - if possible.
Rather later change the MVC routing from controller\action\id to something else, but keep ids int. You can introduce GUID as a property (consuming some space but only ones, not in referenced tables) and change routing to controller\actin\uniqueGuid and on your business layer you will call GetByGuid if really needed to obfuscate users.
Related
I want to understand how surrogate keys are leveraged in real-time DWH environments. I get that they add the benefit of not being dependent on source-generated data to store each dimension key and also avoid having composite key built out of natural keys from dimensions in the fact, For eg, (prod id + cust id+ time id)
But does it not add the complexity of having to maintain the lookup of (natural key, surrogate key) while we load data into facts. I have been working in BI/DW teams for last 3 years and we do not maintain any surrogate keys in our systems. We leverage natural keys to build our datamarts. One sample usecase is revenue data which is stored in transactional system, which is loaded into warehouse at customer, product, time period granularity using the same natural keys from source. We use the same to join with corresponding dimensions to build STAR schema.
Main reason I think it makes sense in our case is that business uses EDW data to do micro-analysis of data at account level, not just trending analysis. We would need to maintain data integrity in that case which we achieve using natural keys. I want to understand how other DW environments work. How do you leverage surrogate keys or natural keys in your systems.
Thanks!
One reason is to maintain and being able to compare historical changes.
Example, if one of your product attributes changes and you wanted to look at and compare revenue before and after the attribute change, how would you do that without using surrogate product keys? Using a natural key would just overwrite the old value when you ETL.
The lookup doesn't have to be very complex to maintain. Most ETL tools have support for this and usually have some caching mechanism built in to cache lookup values.
Also, what do you mean when you say "real-time" data warehouse? Are you using ROLAP, DirectQuery or something similar? If so, you might be building your marts directly on your OLTP system and de-normalize in some semantic model. Then you could use your natural keys because there is no traditional ETL/data warehouse to do lookups and store your surrogate keys.
Lastly, granularity is not related to what type of key you are using.
If your business is stable and runs on top of a single application for everything, natural keys will work just fine, as your experience tells you.
Most businesses are not in such a state or not for very long. Mergers happen, new applications are introduced, legacy stuff refuses to die. New lines of business are started or split off and require wholesale renaming of existing natural key schemes.
Surrogate keys provide great benefits in keeping reporting dimensions stable and usable across the business when you have a bunch of separate new and legacy applications that all have their own versions of your customers and products and regularly get migrated or swapped out for similar systems with new natural key definitions. The major work is linking the various natural keys of a customer/product/whatever, assigning a surrogate key is just a simple and very helpful step in that.
Even in your scenario, I would use surrogate keys as they prepare you for future changes and are very helpful with historical data (as NITHIN B also answered) in Type 2 Dimensions.
It's quite possible to do versioning with natural keys by adding a version field to your dimension and fact tables, but it makes the joins harder to write for reporting and your whole system still gets messy if business or application changes cause the natural keys to change.
To illustrate:
Select bla
from Fact F
inner join Dim_Customer DC
on F.Surrogate_key = DC.Surrogate_Key
is almost foolproof. If you mess this up, it will be immediately obvious in your report.
Select bla
from Fact F
inner join Dim_Customer DC
on F.Natural_key = DC.Natural_Key
and F.Version = DC.Version
does the same job, but if you forget that last line, everything will look normal but your numbers will be inflated depending on how many versions there are on average. Kinda painful when that 25% sales increase turns out to be an error.
An additional reason, which has not been mentioned yet, is performance. Sometimes (very often in my experience) natural keys are strings, sometimes long strings.
It seems not a big deal using 10, 20 or 30 byte string instead of a 4 byte integer, but when you have 10 dimension and hundred of millions of rows, it adds up fast.
Could you please post a sample design.
I would be interested to see how you can load a fact table with Dimension Keys which are natural keys. Kimball design never recommends it.
My stand on Surrogate Keys in DWH.
Surrogate keys give you a lot of flexibility with Type 2 Dimensions,
ie if you have Type 2 Dimensions. For eg: You can track changes of a customer
if he or she changes her second name. You can have rows withe old values and
new values.
Fact tables usually hold keys which are surrogate keys. It makes your star
schema neat and tidy and robust.
However I am not jumping queues here, would wait for your design before going pro or against your stand.
Cheers
Nithin
I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.
Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.
There is a debate between our ETL team and a Data Modeler on whether a table should be normalized or not, and I was hoping to get some perspective from the online community.
Currently the tables are set up as such
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
Both tables are only being populated by a periodic file (from a 3rd party)
through an ETL job
A single record in the file contains all attributes in both tables for a single row)
The file populating these tables is a delta (only rows with some change in them are in the file)
One change to one attribute for one record (again only by the 3rd party) will result in all the data for that record in the file
The Domain Values for Code and Name are
not known.
Question:Should the LookupTable be denormalized into MainTable.
ETL team: Yes. With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row. More Code, Worse Performance, and yes slightly more space. However ,regardless of a change to a LookupTable.Name from a 3rd party, the periodic file will reflect every row affected, and we will still have to parse through each row. If lumped into MainTable, all it is, is a simple update or insert.
Data Modeler: This is standard good database design.
Any thoughts?
Build prototypes. Make measurements.
You started with this, which your data modeler says is a standard good database design.
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
He's right. But this, too, is a good database design.
MainTable
PrimaryKey (PK)
Name
OtherColumns
If all updates to these tables come only from the ETL job, you don't need to be terribly concerned about enforcing data integrity through foreign keys. The ETL job would add new names to the lookup table anyway, regardless of what their values happen to be. Data integrity depends mainly on the system the data is extracted from. (And the quality of the ETL job.)
With this setup, every row from the file will first have to check the
2nd table to see if their FK is in there (insert if it is not), then
add the MainTable row.
If they're doing row-by-row processing, hire new ETL guys. Seriously.
More Code, Worse Performance, and yes slightly more space.
They'll need a little more code to update two tables instead of one. How long does it take to write the SQL statements? How long to run them? (How long each way?)
Worse performance? Maybe. Maybe not. If you use a fixed-width code, like an integer or char(3), updates to the codes won't affect the width of the row. And since the codes are shorter than the names, more rows might fit in a page. (It doesn't make any sense to use a code that longer than the name.) More rows per page usually means less I/O.
Less space, surely. Because you're storing a short code instead of a long name in every row of "MainTable".
For example, the average length of a country name is about 11.4 characters. If you used 3-character ISO country codes, you'd save an average of 8.4 bytes per row in "MainTable". For 100 million rows, you save about 840 million bytes. The size of that lookup table is negligible, about 6k.
And you don't usually need a join to get the full name; country codes are intended to be human-readable without expansion.
I've been seeing a lot of commentary (from an NHibernate perspective) about using Guid as opposed to an int (and presumably auto-id in the database), with the conclusion that using auto-identity breaks the UoW pattern.
This post has a short description of the issue, but it doesn't really tell me "why" it breaks the pattern (unless I'm misunderstanding which is likely the case.
Can someone enlighten me?
There are a few major reasons.
Using a Guid gives you the ability to identify a single entity across many databases, including six relational databases with the same schema but different data, a document database, etc. This becomes important any time you have more than one single place where data goes - and that means your case too: you have a dev database and a prod database, right?
Using a Guid gives NHibernate the ability to batch more statements together, perform more database work at the very end of the unit of work / transaction, and reduce the total number of roundtrips to the database, increasing performance as well as conferring other benefits.
Comment:
Random Guids do not create poor indexes - natively, they create poor clustered indexes. There are two solutions.
Use a partially sequential Guid. With NHibernate, this means using the guid.comb id generator rather than the guid id generator. guid.comb is partially sequential for good performance, but retains a very high degree of randomness.
Have your Guid primary key be a nonclustered index, and put a clustered index on another auto-incrementing column. You may decide to map this column, in which case you lose the benefit of better batching and fewer roundtrips, but you regain all the benefits of short numbers that fit easily in a URL. Or you may decide not to map this column and have it remain completely within the database, in which case you gain better performance for Guids as primary keys as well as better performance for NHibernate doing fewer roundtrips.
My take would be that the key breaking factor is that getting the auto-incremented value requires an actual write to the database, which nHibernate would have deferred or possibly never performed.
Using identity and in a parent-child scenario the database has round trip the database to get the ID of a parent so that it can associate the child correctly. This means that the parent has to be committed at this time. Should there be a problem with the child you would then need to delete the parent in order to exit the UoW correctly.