what is Dim, what is Fact? - data-warehouse

I have an application that I know would make a great cube and would be useful for more than the standard flat Reporting Services report. We're about to jump into BI stuff with a consultant, but I'd like to give it a shot before we do, mostly so I know something of what we're going to do.
The application tracks surveys in nursing homes across the country. They can be annual, complaint, or several other type of survey, they have penalties associated with tags given, and have documentation associated with them.
What I'd like to do is come up with a way that will allow us to leverage the data we have - how many tags in florida for the month of June? How many facilities were on time delivering their documentation? How many annual(surprise) surveys happened in the 1st quarter of this year compared to last year?
I'm including the schemas in hopes that someone will be able to tell me not only what is dim and what is fact, but what data goes where. I figure that'll be a great start.
Anything would be really helpful. I'm trying to get a small data mart set up while I'm pouring through the Data Warehouse Lifecycle Toolkit by Kimball.
Thanks!
M#
The Entity table - a list of all of our facilities: Primary key is a five letter code denoting the building
CREATE TABLE [dbo].[Entity](
[entID] [varchar](10) NOT NULL,
[entShortName] [varchar](150) NULL,
[entNumericID] [int] NOT NULL,
[orgID] [int] NOT NULL,
[regionID] [int] NOT NULL,
[portID] [int] NOT NULL,
[busTypeID] [int] NOT NULL,
[adpID] [varchar](50) NULL,
[eHealthDataID] [varchar](50) NULL,
[updateDate] [datetime] NULL CONSTRAINT [DF_Entity_updateDate] DEFAULT (getdate()),
[powProID] [int] NULL,
[regionReportingID] [int] NULL,
[regionPresEmail] [varchar](300) NULL,
[regionClinDirEmail] [varchar](300) NULL,
CONSTRAINT [PK_EntityNEW] PRIMARY KEY CLUSTERED
(
[entID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]
Survey Main
CREATE TABLE [dbo].[surveyMain](
[surveyID] [int] IDENTITY(1,1) NOT NULL,
[surveyDateFac] AS (([facility]+'-')+CONVERT([varchar],[surveyDate],(101))),
[surveyDate] [datetime] NOT NULL,
[surveyType] [int] NOT NULL,
[surveyBy] [int] NULL,
[facility] [varchar](10) NOT NULL,
[originalSurvey] [int] NULL,
[exitDate] [datetime] NULL,
[dpnaDate] AS (dateadd(month,(3),[exitDate])),
[clearedTags] [varchar](1) NULL,
[substantiated] [varchar](1) NULL,
[firstRevisit] [int] NULL,
[secondRevisit] [int] NULL,
[thirdRevisit] [int] NULL,
[fourthRevisit] [int] NULL,
[updated] [datetime] NULL CONSTRAINT [DF_surveyMain_updated] DEFAULT (getdate()),
CONSTRAINT [PK_tagSurvey] PRIMARY KEY CLUSTERED
(
[surveyID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
Survey Types:
CREATE TABLE [dbo].[surveyTypes](
[surveyTypeID] [int] IDENTITY(1,1) NOT NULL,
[surveyTypeDesc] [varchar](100) NOT NULL,
CONSTRAINT [PK_surveyTypes] PRIMARY KEY CLUSTERED
(
[surveyTypeID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Survey Files
CREATE TABLE [dbo].[surveyFiles](
[surveyFileID] [int] IDENTITY(1,1) NOT NULL,
[surveyID] [int] NOT NULL,
[surveyFilesTypeID] [int] NOT NULL,
[documentDate] [datetime] NOT NULL,
[responseDate] [datetime] NULL,
[receiptDate] [datetime] NULL,
[dateCertain] [datetime] NULL,
[fileName] [varchar](250) NULL,
[fileUpload] [image] NULL,
[fileDesc] [varchar](100) NULL,
[updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFiles_updated] DEFAULT (getdate()),
CONSTRAINT [PK_surveyFiles] PRIMARY KEY CLUSTERED
(
[surveyFileID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Survey Fines
CREATE TABLE [dbo].[surveyFines](
[surveyFinesID] [int] IDENTITY(1,1) NOT NULL,
[surveyID] [int] NULL,
[surveyFinesTypeID] [int] NULL,
[dateRecommended] [datetime] NULL,
[dateImposed] [datetime] NULL,
[totalFineAmt] [varchar](100) NULL,
[wasImposed] [varchar](3) NULL,
[dateCleared] [datetime] NULL,
[comments] [varchar](500) NULL,
[updated] [datetime] NOT NULL CONSTRAINT [DF_surveyFines_updated] DEFAULT (getdate()),
CONSTRAINT [PK_surveyFines] PRIMARY KEY CLUSTERED
(
[surveyFinesID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 75) ON [PRIMARY]
) ON [PRIMARY]
Survey Tags
CREATE TABLE [dbo].[surveyTags](
[seq] [int] IDENTITY(1,1) NOT NULL,
[surveyID] [int] NOT NULL,
[tagDescID] [int] NOT NULL,
[tagStatus] [int] NULL,
[scopesev] [varchar](5) NOT NULL,
[comments] [varchar](1000) NULL,
[clearedDate] [datetime] NULL,
[updated] [datetime] NULL CONSTRAINT [DF_surveyTags_updated] DEFAULT (getdate()),
CONSTRAINT [PK_tagMain] PRIMARY KEY CLUSTERED
(
[seq] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

What I'd like to do is come up with a way that will allow us to leverage the data we have - how many tags in Florida for the month of June? How many facilities were on time delivering their documentation? How many annual(surprise) surveys happened in the 1st quarter of this year compared to last year?
A dimension is a measurement range. The measurement range can be continuous, like dates, or discrete, like facilities. In your questions, the dimensions are facility and date, date/time, and date, respectively.
The only way you can answer the question "How many tags in Florida for the month of June?" is to associate tags with facilities and tags with dates.
The only way you can answer the question "How many facilities were on time delivering their documentation?" is to associate documentation delivery with facility and date due with facility.
You should follow this same analytical process with the rest of the questions or queries you expect the data warehouse to answer.
A fact is an entity or an object. A tag is a fact. Documentation delivery is a fact. Facts are almost always immutable in a data warehouse once they're loaded.
As to your schema, I'd have to study it more to give specific recommendations, but in general, you want to use a star schema. The center of the star(s) are your facts, entities, and objects. The tables that make up the points of the star are your dimension tables.
The first thing you need to do is separate your facts and your dimensions. None of your entity tables should contain dates, location codes, or whatever else you determine is a dimension. However, your fact tables will contain foreign keys to date tables, location tables, or other dimension tables.
You'll probably also need summary tables. Summary tables contain the same columns as your fact tables, with the addition of one or more sums across different dimensions. As an example, the question "How many tags in Florida for the month of June?" can be answered much quicker if you already have the sum of the tags for Florida (or, more properly, each facility in Florida) for the month (or each of the days) of June, 2010.
The period that you sum for depends on the mixture of queries that you expect. In your data warehouse, day might be too short a period. In other words, it's just as quick to do the summary in SQL as it is to select the summary row.
You'll need a calendar table too. A calendar table makes questions like, "How many annual(surprise) surveys happened in the 1st quarter of this year compared to (the 1st quarter of) last year?" much easier to query.

This is quite a task for a support forum, so I will focus on just one part of the problem.
Seems that one survey can consists of several visits, so I would suggest factSurveyVisit with a grain of one visit-event. The column SurveyID acts as a degenerate dimension in this model and is common to all visits from the same survey. The SurveyVisitSequenceID is a unique auto-increment (integer) and is used to simplify linking of the two bridge tables for documents and tags to the fact table.
You could also promote a survey into a full dimension dimSurvey to add some notes etc; use SurveyID for link.
I did not tackle fines here, for this I would suggest factFine table which would have its own links to dimDate, dimTime, dimFacility, etc so that reports regarding fines ($$) can be done fast without joining to most of the visit related tables. There should also be a bridge table joining factFine to factSurveyVisit, providing fines are related to each visit and not to a completed survey.
EDIT
Just noticed that your Tag table has date_cleared, so admittedly I do not understand the tagging in this business. In the model, dimTag is just a list of available tags. There may be one more factFacilityStatus table linking dimFacility and dimTag, tracking tag status for each facility.

It looks like you have multiple Fines, Files and Tags for each survey.
I would expect 4 fact tables - with the facts in each looking like they are largely datetime data (although these are often modelled as roles of a date and/or time dimension - I've made a couple notes here, but flags are generally going to be in dimensions):
SurveyMain
SurveyFine (wasImposed is in a dimension linked to this fact, totalFineAmt is a fact in this table)
SurveyFile
SurveyTag
They would all share a Survey dimension, and I would go ahead and share an Entity/Facility dimension in each one. You could snowflake through the Survey dimension, but that defeats the most beneficial point of star models allowing you to get to all data directly instead of going though bridge tables.
You have an option of putting the survey type in it's own dimension (or a junk dimension, perhaps) or having it accessed through the Survey dimension (not through a snowflake). That's typical with dimensional modeling - you don't need to follow your entities - you just need to avoid the too many dimensions and too few dimensions trap and watch the cardinality of your dimensions - especially if you've accidentally included some degenerate dimension like an invoice number which changes with every fact and so needs to be stored in the fact table.
Actually, it's sometimes easier to do your star models by doing the typical joins in your 3NF which create typical flat reporting views and then simply taking those flat rows and turning them into stars. (That's how little relevance the entity-relationship model really has to the dimensional model). So you might join SurveyMain to SurveyTypes and SurveyFine on your current normalized keys and look at all the columns. This would be the basis for the SurveyFine fact table. Ditto for the other fact tables I identified. The shared stuff would be a candidate for shared dimensions. Entity is a good candidate for a conformed dimension (i.e. it's going to be shared between these survey models and other models related to your enterprise - like HR models or accounting models).

I would setup SurveyFines, SurveyTag and SurveyFiles fact tables, they are all different grains of facts and they all represent the lowest grain.
They would all have date, Entity and Survey Dimensions with them.
I would then setup pre-aggregated metric tables for those metrics which might need to combine all three facts.
If you would like me to elaborate feel free to ask. I'm in a bit of rush today.
(continuing...)
It would appear to me, that your users want to pivot the measurable data (number of files, date files were sent, sum of fines). They want to look at those metrics by attributes of the Survey. That's why I suggest a survey dimension.
Considering your comment below, I might then build a pre-aggregate metric table,
Date (the date I loaded the metric table)
SurveyDimID
EntityDimID
NumTagsAssigned
NumFilesRequested
NumFilesReceived
NumFines
TotalFines
etc...
I would load this table everyday with the full set of active survey data from my fact tables. This allows the users to go back and forth through history to see how the survey's came in.
I suppose at some point the entire survey process is complete, at that point those records would not be included in the metric load. (They would remain in the facts).

Related

Unable to find column names in a FK constraint

I have created two tables in Snowflake.
create or replace TRANSIENT TABLE TESTPARENT (
COL1 NUMBER(38,0) NOT NULL,
COL2 VARCHAR(16777216) NOT NULL,
COL3 VARCHAR(16777216) NOT NULL,
constraint UNIQ_COL3 unique (COL3)
);
create or replace TRANSIENT TABLE TESTCHILD3 (
COL_A NUMBER(38,0) NOT NULL,
COL_B NUMBER(38,0) NOT NULL,
ABCDEF VARCHAR(16777216) NOT NULL,
constraint FKEY_1 foreign key (COL_A, COL_B) references TEST_DB.PUBLIC.TESTPARENT1(COL1,COL2),
constraint FKEY_2 foreign key (ABCDEF) references TEST_DB.PUBLIC.TESTPARENT(COL3)
);
Now I want to execute a query and see the names of columns that are involved in FKEY_2 FOREIGN KEY
in Table TESTCHILD3, but it seems like there are no DB Table/View that keeps this information. I can find out the column names for UNIQUE KEY & PRIMARY KEY but there is nothing for FOREIGN KEYS.
EDIT
I have already tried INFORMATION_SCHEMA.TABLE_CONSTRAINTS, along with INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and all the other system tables. No luck. Only DESC TABLE is giving me some info related to CONSTRAINTS and COLUMNS but that also has FOREIGN KEY CONSTRAINTS information missing.
SHOW IMPORTED KEYS IN TABLE <fk_table_name>;
Updated answer:
I was checking on something unrelated and noticed a very efficient way to list all primary and foreign keys:
show exported keys in account; -- Foreign keys
show primary keys in account;
When you limit the call to a table, it appears you have to request the foreign keys that point to the parent table:
show exported keys in table "DB_NAME"."SCHEMA_NAME"."PARENT_TABLE";
You can check the documentation for how to limit the show command to a specific database or schema, but this returns rich information in a table very quickly.
maybe you can try to query this view: INFORMATION_SCHEMA.TABLE_CONSTRAINTS
Note: TABLE_CONSTRAINTS only displays objects for which the current role for the session has been granted access privileges.
For more see: https://docs.snowflake.net/manuals/sql-reference/info-schema/table_constraints.html

Is my ER-Diagram for yearly data on trade and transportation OK?

As part of my school project, I'm supposed to design/make a database where one can store/update/retrieve yearly data about international trade and transportation. To begin with, I isolated a small part of the database in order to start small.
Firstly I tried to design a diagram that would store the number of passengers (not individual passengers) that embarked/disembarked on/off ships in each port of every country every year and how many local and foreign passengers there were (I don't need those two to interact).
(Ignore the Passengers on the top.) and the inwards_outwards entity would give me a table in the database that would look like this:
Secondly I tried to design the diagram of a table where I could store Origin-Destination data (e.g. of the passengers that arrived in (or left from ) a country, how many came from (went to) each other country etc.
For instance in 2011, from England 20 passengers flew to France, 10 to Germany, etc. and in 2011, in England arrived 23 from France, 19 from Germany, etc.
and the od_hellas entity would give me a table like this:
Questions:
Do the above look OK to you?
Is there a more efficient way to store yearly data?
Is what I'm trying to make doable in the context of a project? Any advice in general?
You can do this with three tables as shown below.
If you want to add data about Passengers then you would need a fourth table "Passenger"
The value in your "Numbers" column can be calculated from the base data by using SQL COUNT something like this:
SELECT COUNT(passengerNr)
FROM Departure
WHERE portCode = "EL_OGRPIR";
To get the data by year, you just add something like [AND date = "2011"] (depends on how you choose to store your date data.)
If my solution helps, please click on the vote icon.
Here is the logical view of the tables.
Here is the SQL DDL that you would use to generate the tables in a database. (e.g. you could cut and paste this SQL into the "New Query" panel in SQL Server Management Studio.)
CREATE SCHEMA Trade
GO
CREATE TABLE Trade.Port
(
portCode nchar(15) NOT NULL,
countryCode nchar(2) NOT NULL,
portName nchar(50) NOT NULL,
type nchar(10) CHECK (type IN (N'SeaPort', N'AirPort', N'LandBorder')) NOT NULL,
CONSTRAINT Port_PK PRIMARY KEY(portCode)
)
GO
CREATE TABLE Trade.Departure
(
passengerNr int NOT NULL,
portCode nchar(15) NOT NULL,
"date" datetime NOT NULL,
isInternational bit,
CONSTRAINT Departure_PK PRIMARY KEY(passengerNr, portCode)
)
GO
CREATE TABLE Trade.Arrival
(
passengerNr int NOT NULL,
portCode nchar(15) NOT NULL,
"date" datetime NOT NULL,
isInternational bit,
CONSTRAINT Arrival_PK PRIMARY KEY(passengerNr, portCode)
)
GO
ALTER TABLE Trade.Departure ADD CONSTRAINT Departure_FK FOREIGN KEY (portCode) REFERENCES Trade.Port (portCode) ON DELETE NO ACTION ON UPDATE NO ACTION
GO
ALTER TABLE Trade.Arrival ADD CONSTRAINT Arrival_FK FOREIGN KEY (portCode) REFERENCES Trade.Port (portCode) ON DELETE NO ACTION ON UPDATE NO ACTION
GO

Records updating in a weird order when saving to localdb - LINQ

My records update successfully in my controller, but View Data shows weird indexing behaviour.
Now, this would kinda make sense if my ID field wasnt an unique identity but... scripting to clipboard generates this:
USE [myDb]
GO
/****** Object: Table [dbo].[Contacts] Script Date: 10/25/2018 1:02:01 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Contacts] (
[Id] INT IDENTITY (1, 1) NOT NULL,
[Name] NVARCHAR (255) NOT NULL,
[ContactType] NVARCHAR (255) NOT NULL,
[ContactNumber] BIGINT NOT NULL,
[BirthDate] DATE NOT NULL
);
What am I missing here? This is a small project and shouldnt really be an issue because my page loads with the Contact ID and then updates that same ID in my controller.. but it'd be nice to know why this is happening for the future.

How to create a derived attribute in DDL

CREATE TABLE Matches
(
mID INTEGER PRIMARY KEY,
date DATE NOT NULL,
location CHAR(25) NOT NULL,
teamA CHAR(15) NOT NULL,
goalsForA INTEGER,
pointsA INTEGER,
teamB, CHAR(15) NOT NULL,
goalsForB INTEGER,
pointsB INTEGER,
/*
M1: The match number must be under 65
*/
CONSTRAINT M1 CHECK (mID < 65),
/*
M2: location must refer to stadiumName in the Locations.
*/
CONSTRAINT M2 FOREIGN KEY (location) REFERENCES Stadiums (stadiumName)
ON DELETE CASCADE,
/*
M3:
*/
CONSTRAINT M3 ()
)
okay so I need to make it so pointsA and pointsB is calculated by goalsForA and goalsForB. If goalsForA = goalsForB then pointsA and pointsB get 1 each. If goalsForA > goalsForB, then pointsA gets 3 added and vice versa for B. My professor never taught us about how to do this, and I can't find it anywhere.
As a general rule, don't store what you can calculate. Make a table without pointsA and pointsB, then make a view of the table with calculated columns.

Entity Framework and self referencing table

I have a table that self references to create a hierarchy.
CREATE TABLE [dbo].[Topics](
[ID] [uniqueidentifier] NOT NULL,
[ParentTopicID] [uniqueidentifier] NULL,
[Name] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Topics] PRIMARY KEY CLUSTERED
([ID] ASC)
WITH (
PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
ON [PRIMARY]
GO
ALTER TABLE [dbo].[Topics] WITH CHECK ADD CONSTRAINT [FK_Topics_Topics]
FOREIGN KEY([ParentTopicID]) REFERENCES [dbo].[Topics] ([ID])
For the "root" nodes, the ParentTopicID will be null, and children will point to appropriate TopicID.
This structure works in SQL but Entity Framework appears to be having problems with this. Even if I try a simple enumeration such as:
foreach(var t in container.Topics) {
Console.WriteLine(t);
}
I get an error:
The 'ParentTopicID' property on 'Topic' could not be set to a 'null'
value. You must set this property to a non-null value of type 'Guid'.
The second problem is to query this table to find the root nodes or children of a particular topic.
In SQL, it would be simple as Where ParentTopicID is null but since Guid is not null in .Net, Linq syntax complains and doesn't find any matches.
yes, the problem here is your problem has NULL specified for the ParentTopicID but in EF Designer you probably have ParentTopicID set to nullable false. Change that first and we can go from there if it doesn't fix it.
In the designer, select the class, select ParentTopicID, press F4 for properties.

Resources