I have read many articles regarding why we should not have business logic in several places, but try to keep it in BLL code. I understand the point of easy maintenance, and clearer understanding what the code does.
However, I never found any explanation what should we do in cases when applying (repeating) some of the business rules to stored procedure would significantly reduce data transfer from database to client app?
For example, I am currently working os some statiscal data presentation over a longer period of time. Currently all business logic / rules are in bussiness Logic layer (dll). A user has an option to display some results on the month level for one year. This would mean that, if I am not to use business rules in stored procedure, I would need to return about 1,000,000 records, and then apply business rules to this records on the client side. However, if I am to apply business rules to stored procedure, then it would reduce the number of returned records to 12.
An example of applying business rules would look something like this:
AVG(CASE WHEN Field1 IS NULL
THEN CASE WHEN c.Field2 = 1
THEN ( cap1.Field3 / cap1.Field4) * 60
ELSE CASE
..... etc
so it is not a simple logic, but complex one. And since this kind of logic could repeat in many different stored procedures, that would be a candidate for a separate function in database, to avoid repetitive code.
So, what is the recommended way here? And why?
Maybe you can still have business logic where it belongs and classify this stuff as more "calculations"?
Either way, you have a compelling reason to do the calculations in the database layer when you are at a million plus rows. So I would keep the calculation in functions. So in your example a reusable function would be used like:
SELECT AVG(dbo.fnFieldsEvaluate(Field1, Field2, Field3, Field5)) as FieldAvgs,
...
Or if it is used a lot, simple enough and only depends on the columns in a single row, a computed column in the table would be more convenient.
CREATE TABLE dbo.Products
(Field1 ....,
Field2 ....,
RowEvaluatesTo AS CASE WHEN Field1 IS NULL
THEN CASE WHEN Field2 = 1
THEN(Field3 / Field4) * 60
ELSE CASE ...
Your function dbo.fnFieldsEvaluat, (or a computed column), would provide the one place where that calculation lives.
Related
We are implementing a new dwh solution. I have many dimensions that require slowly changing type 2 attributes. I was considering implementing a combination of Type 2 and Type 1 attributes in my dimension. That is for some dimension attributes, we track history by inserting new rows in the dim table (Type2), for other attributes we will just update the existing row for any changes (Type1)
Questions:
Is this a good practice? is it OK to have a combination of SCD 1 and 2 for the same dim?
Is there any limit on the number of SCD 2 attributes in a dimension? My dimension is pretty wide, like 300 cols, out of which one of the users is requesting for about 150 cols to be tracked by scd type 2. Is it OK to have so many scd2 attributes in a dim? Is there going to be any impact on performance of downstream reporting BI solutions like cubes and dashboards because of this?
In the OLTP system, we maintain an "audit" table to log any updates. Though this is not in a very easily queryable format, we get answers to most of our questions related to changes from this. We don't need much reporting on data changes. Of course there are some important columns like Status for which we definitely need SCD2 but rest of the columns, I am not sure having history for lot of other columns in the DWH adds any value. My question is when we have this audit table in OLTP, how do I decide what attributes need SCD 2 in the DWH?
Good practice? Yes. Standard feature of dimensional modelling that is overlooked too often. I've seen dimensions with combinations of SCD0, SCD1 and SCD2, and there's nothing to prevent other SCD-types being used as well.
No limit on columns, but that does seem a little excessive. You probably want to use a "hash" method to detect the SCD2 changes, where you calculate a hash over the SCD2 columns, and use this value to detect if any of the columns have changed.
Sorry, but I don't understand the question about audit logs. Are these logs your data source?
I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.
In my star schema, I have a project dimension which has columns such as start_date, finish_date, service_date, onhold_date, resume_date etc.
Should I introduce foreign keys for all the dates in the fact table and connect them to a date dimension or should I snowflake the project_dimension with date_dimension? Not all the dates are available for a given project so keeping all these columns in a fact_table may result in having null keys in fact_table.
What is the best way to handle dates in this scenario?
In a data warehouse, I always prefer a general star schema, snowflaked as little as possible, although this is obviously a bit of personal preference, and can depend on what environment you are using. For Oracle (the environment I am most used to) it supports snowflaking physically, but best practice denotes not to snowflake the business model (logical) layer.
Personally, I would push for putting the FKs on the fact for a few reasons. One, that maintains a star, which generally performs better as snowflakes introduce more joins, and stars handle aggregation quicker. Two, if you have users combining this data with data from other facts, having a conformed date dimension just makes sense, can help query performance, and is more robust. Finally, stars are probably most common, so having others work on this area in the future should be easier/the data may work better with other applications in the future.
For null FKs, I would default to whatever default date your system has, for us, our unspecified record is 01/01/1901. I would not leave them null, unless it is desired to not see 1901 by business users, and even then, I would probably null them out with a case statement, but still leave the field filled on the table.
Here is a good article describing the advantages/disadvantages of each type. Like I said, neither is completely right or wrong.
http://www.dataonfocus.com/star-schema-and-snowflake-schema/
I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.
I have a table that is currently at 40 fields. A significant expansion of its capability now has it looking like something more like 100 fields.
What are the database and Rails performance implications of having a table with more fields? My understanding of relations is that they don't load the data until absolutely necessary, but would having so much more information slow down, say, a filtered index of these records (showing only the main 8-10 fields)?
The fields I'm specifically talking about adding are not relevant to any of my reports or most of my queries - they simply store data that is used on the back end.
Normalization is not a problem here (there are no fields like field1, field2, ..., for example). I know it's hard to answer these questions when posed in a qualitative manner, but is it likely better to build these 60 fields in this table, or should I create a separate 1-1 table for them?
Having a single table is not a big deal and make things easier when it comes to queries. So if it's relevant, no need to split.
Still, you should only query what you need in your views so use the ActiveRecord's select: doc here.
Yes, having a lot of fields will slow down access to the table, however, in general not significantly enough that it matters for average data sizes. Most SQL databases arrange tables row by row, so on the disk, first all 40 fields of row 1, then all 40 fields of row 2, and so on, are stored. This means, that if you are only interested in retrieving the first 2 fields, you will still read all other 38 fields and then jump to the next row that matches. This is not a big issue if you have only a few matching rows, but might be, if you would have many matches that are also consecutive.
That said, I would still heavily advice against a table with 40 fields, except when there is a very good reason to do so (which you might have, but you give to little details to answer this). In general, having that many fields indicates the use of some alternative design. Definitly, if what I wrote above starts becoming an issue, you should order the fields according to the access patterns (so if normally fields 1-10 and 20,24,25,30 are accessed together, put those groups into separate tables).