Naming standards for dimensional modeling - data-warehouse

I am working on my first dimensional modeling assignment for a Data Warehouse project using Kimball's approach. As I prepare my model and think about physical objects, I wonder what is the recommended naming scheme for database objects. We're going to use Oracle, and we don't really have any standards at present. Any help would be appreciated.

You can take some ideas from the Oracle BI Applications Data Model.
Log in to your Oracle support account and look for this document:
Oracle Business Analytics Warehouse Data Model Reference Version 7.9.6.3 (Doc ID 1325948.1)
These are some of the naming conventions included:
PREFIX
W_ = Warehouse
SUFFIX
_A = Aggregate
_D = Dimension
_DH = Dimension Hierarchy
_DHS = Staging for Dimension Hierarchy
_DS = Staging for Dimension
_F = Fact
_FS = Staging for Fact
_H = Helper
_MD = Mini Dimension
_TMP = Pre-staging temporary table
For example: Sales fact table would be W_Sales_F
This document from northwestern university has useful tips for naming columns, such as using prime, qualifier and class words (e.g. STUDENT_FIRST_NAME)
The kimball group's design tip #71 contains general guidelines for naming conventions
For example, a sales analyst would be interested in Sales numbers, but
it turns out that this Sales number is really
Sales_Commissionable_Amount, which is different from
Sales_Gross_Amount and Sales_Net_Amount.

Related

Datawarehouse design

I am going to design a Datawarehouse (although its not an easy process). I am wondering through out the ETL process , how the data in the Datawarehouse is going to extract/transform to Data Mart ?
Are there any model design within Datawarehouse vs Datamart ? Also usually starschema or snowflake?so should we place the table like in the following
In Datawarehouse
dim_tableA
dim_tableB
fact_tableA
fact_tableB
And in Datamart A
dim_tableA (full copy from datawarehouse)
fact_tableA (full copy from datawarehouse)
And in Datamart B
dim_tableB (full copy from datawarehouse)
fact_tableB (full copy from datawarehouse)
is it something real life example which can demonstrate the model difference between datawarehouse and datamart ?
I echo with both Nick's responses and in more technical way following Kimball methodology:
In my opinion and my experience. At high level ,we have data marts like Service Analytics , Financial Analytics , Sales Analytics , Marketing Analytics ,Customer Analytics etc. These were grouped as below
Subject Areas -> Logical grouping(Star Modelling) ->Data Marts -> Dimension &Fact (As per Kimball’s)
Example:
AP Real Time ->Supplier, Supplier Transaction’s , GL Data -> Financial Analytics + Customer Analytics->Physical Tables
Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. ... A data warehouse is a large centralized repository of data that contains information from many sources within an organization.
Depending on their needs, companies can use multiple data marts for different departments and opt for data mart consolidation by merging different marts to build a single data warehouse later. This approach is called the Kimball Dimensional Design Method. Another method, called The Inmon Approach, is to first design a data warehouse and then create multiple data marts for particular services as needed.
An example: In a data warehouse, email clicks are recorded based on a click date, with the email address being just one of the click parameters. For a CRM expert, the e-mail address (or any other customer identifier) ​​will be the entry point: opposite each contact, the frequency of clicks, the date of the last click, etc.
The Datamart is a prism that adapts the data to the user. In this, its keys to success depend a lot on the way the data is organized. The more understandable it is to the user, the better the result. This is why the titles of each field and their method of calculation must stick as closely as possible to the uses of the trade.

Does a data warehouse need to satisfy 2NF or another normal form?

I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.

self-join on documentdb syntax error

I'm having trouble doing an otherwise SQL valid self-join query on documentdb.
So the following query works:
SELECT * FROM c AS c1 WHERE c1.obj="car"
But this simple self join query does not: SELECT c1.url FROM c AS c1 JOIN c AS c2 WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url, with the error, Identifier 'c' could not be resolved.
It seems that documendb supports self-joins within the document, but I'm asking on the collection level.
I looked at the official syntax doc and understand that the collection name is basically inferred; I tried changing c to explicitly my collection name and root but neither worked.
Am I missing something obvious? Thanks!
A few things to clarify:
1.) Regarding Identifier 'c' could not be resolved
Queries are scoped to a single collection; and in the example above, c is an implicit alias for the collection which is being re-aliased to c1 with the AS keyword.
You can fix the example query changing fixing the JOIN to reference c1:
SELECT c1.url
FROM c AS c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
This is also equivalent to:
SELECT c1.url
FROM c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
2.) Understanding JOINs and examining your data model
With that said, I don't think fixing the query syntax issue above will produce the behavior you are expecting. The JOIN keyword in DocumentDB SQL is designed for forming a cross product with a denormalized array of elements within a document (as opposed to forming cross products across other documents in the same collection). If you run in to struggles here, it may be worth taking a step back and revisiting how to model your data for Azure Cosmos DB.
In a RDBMS, you are trained to think entity-first and normalize your data model based on entities. You rely heavily on a query engine to optimize queries to fit your workload (which typically do a good, but not always optimal, job for retrieving data). The challenges here are that many relational benefits are lost as scale increases, and scaling out to multiple shards/partitions becomes a requirement.
For a scale-out distributed database like Cosmos DB, you will want to start with understanding the workload first and optimize your data model to fit the workload (as opposed to thinking entity first). You'll want to keep in mind that collections are merely a logical abstraction composed of many replicas that live within partition sets. They do not enforce schema and are the boundary for queries.
When designing your model, you will want to incorporate the following questions in to your thought process:
What is the scale, in terms of size and throughput, for the broader solution (an estimate of order of magnitude is sufficient)?
What is the ratio of reads vs writes?
For writes - what is the pattern for writes? Is it mostly inserts, or are there a lot of updates?
For reads - what do top N queries look like?
The above should influence your choice of partition key as well as what your data / object model should look like. For example:
The ratio of requests will help guide how you make tradeoffs (use Pareto principle and optimize for the bulk of your workload).
For read-heavy workloads, commonly filtered properties become candidates for choice of partition key.
Properties that tend to be updated together frequently should be abstracted together in the data model, and away from properties that get updated with a slower cadence (to lower the RU charge for updates).
Don't be afraid to duplicate properties to enrich queryability, and annotate types, across different record types. For example, we have two types of documents: cat and person.
{
   "id": "Andrew",
   "type": "Person",
   "familyId": "Liu",
   "employer": "Microsoft"
}
 
{
   "id": "Ralph",
   "type": "Cat",
   "familyId": "Liu",
   "fur": {
         "length": "short",
         "color": "brown"
   }
}
 
We can query both types of documents without needing a JOIN simply by running a query without a filter on type:
SELECT * FROM c WHERE c.familyId = "Liu"
And if we wanted to filter on type = “Person”, we can simply add a filter on type to our query:
SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"
Above Answer has queries mentioned by #Andrew Liu. This will resolve your error but Azure Cosmos DB does not support Cross-item and cross-container joins. Use this link to read about joins https://learn.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-join

Data Warehousing - Star Schema vs Flat Table

I'm trying to design a Data Warehouse for a single store of commonly required data ranging from finance systems, project scheduling systems and a myriad of scientific systems. I.e. many different data marts.
I have been reading up on Data Warehousing and popular methods such as Star Schemas and Kimball methods etc but one question I cannot find answer to is:
Why is it better to design your DW Data Mart as a star schema rather than a single flat table?
Surely having no joins between facts and attributes/dimensions is faster and simpler than having lots of small joins to all the dimension tables? Disk space is not a problem, we'll just throw more disks at the database if necessary. Is the star schema slightly outdated these days or is it still data architect dogma?
Your question is very good: the Kimball mantra for dimensional modelling is to improve performance and to improve usability.
But I don't think it is outdated, or dogma- it is a reasonable, practical approach for many situations and platforms.
The way relational DBs store data means there's a balancing act to be struck between the numbers and types of tables, the routes in to the data for typical queries, easy maintainability and description of relationships between data, the numbers of joins, the way the joins are constructed, the indexability of columns, etc.
3NF (or further) is one end of the spectrum, suiting OLTP systems, and a single table is the other end of the spectrum. Dimensional models are in the middle and appropriate for reporting, at least when using certain technologies.
Performance isn't all about 'number of joins', although a star schema performs better for reporting workloads than a fully normalised database, in part because of a reduce number of joins. Dimensions are typically very wide. If you are including all those dimension fields in every row of every fact, you have very large rows indeed, and finding your way into those rows will perform very badly for typical queries.
Facts are numerous, so if you can make those tables compact, with the 'wordier' dimensions filterable, you hit a sweet spot of performance that a single table isn't going to match, unless heavily indexed.
And yes a single table for a fact is simpler in terms of numbers of tables but is it really easier to navigate? Dimensions and facts are easy concepts to understand, and what if you want to cross you queries across facts? You've got many different data marts but one of the benefits of having a data warehouse in the first place is that these aren't distinct- they're related and can be reported across. Conformed dimensions enable this.
If you combine your fact and dimensions into a single table, you'll either lose the visibility into dimension attributes that have never been used, or your measures will be thrown off by inclusion of a dummy event for the unused dimension attribute.
For example, a restaurant menu is a dimension and the purchased food is a fact. If you combined these into one table, how would you identify which food has never been ordered? For that matter, prior to your first order, how would you identify what food was available on the menu?
The dimension represents the possibilities, the fact represents the realization of the possibilities.
Combining facts and dimensions in the same table limits the scalability and the flexibility.
Suppose that one day the business decides to change a dimension description ( for example the product name ). Dimension tables aren't as deep as the fact tables and the update process or SCD management should be easier and less resource intensive.

How to model a relational database into a neo4j graph database?

I have a relational database (about 30 tables) and I would like to transpose it in a neo4j graph database, and I don't know where to start...
Is there a general way to transpose tables and/or tuples into a graph model ? (relations properties, one or more graphs ?) What are the best sources of documentation ?
Thanks for any help,
Best regards
First, if at all possible, I'd suggest NOT using your relational DB as your "reference" for transposing to a graph model. All too often, mistakes and pitfalls from relational modelling get transferred over to the graph model and introduce other oddities. In fact, if you have a source ER diagram, that might be an even better starting point as it's really already a graph. And maybe even consider a re-modelling exercise for your domain!
That said, from a basic point of view, you can think of most tables as representing a node type (e.g. "User" or "Movie") with join tables and keys representing relationship types.
A great starting point, from my perspective anyway, is to determine some questions your graph/data source should answer. Write those questions down, and try to come up with Cypher queries that represent the questions. Often times, a graph model naturally arises from such an effort, and it's really not that difficult.
If you haven't already, I'd strongly recommend picking up a (free) copy of the Graph Databases ebook from here: http://graphdatabases.com/
It's jam-packed with a lot of good info on where to start with modelling your domain and even things to consider when you're used to doing things in a relational manner. It also contains some material on Cypher, although the Neo4j site (neo4j.org) has a reference manual with plenty of up-to-date info on Cypher.
Hope this helps!
There's not going to be a one-stop-shop for this kind of conversion, as not all data models are appropriate for graph modeling, and every application is a unique special snowflake...but with that said.....
Generally, your 'base' tables (e.g. User, Role, Order, Product) would become nodes, and your 'join tables' (a.k.a. buster tables) would be candidates for your relationships (e.g. UserRole, OrderLineItem). The key thing to remember that in a graph, generally, you can only have one relationship of a given type between two specific nodes - so in the above example, if your system allows the same product to be in an order twice - it would cause issues.
Foreign keys are your second source of relationships, look to them to see if it makes sense to be a relationship or just a property.
Just keep in mind what you are trying to solve by your data model - if it's traversing your objects to find relationships and distance, etc... then graphs may be a good fit. If you are modeling an eCommerce app, where you are dealing with manipulating a single nested object (e.g. order -> line item -> product -> sku), then a relational model may be the right fit.
Hope my $0.02 helps...
As has been already said, there is no magical transformation from a relational database model to a graph database model.
You should look for the original entities and how they are related in order to find your nodes, properties and relations. And always keeping in mind what type of queries you are going to perform.
As BtySgtMajor said, "Graph Databases" is a good book to start, and it is free.

Resources