Should the "count" measure be stored in the fact table?

Should the "count" measure be stored in the fact table? - data-warehouse

I have a fact table that includes "wait times in hours" for certain services. I have a lot of dimensions that could describe the wait-times based on different slices; however, I am also interested in knowing how many people (counts) came for services through the filters of the same dimensions.
Given the dimensions for both the wait-times in hours and the number of people who got services are exactly the same, I think it's best practice to keep it in the same fact table. My question is:
Should there be a different fact table for the count measure mentioned?
How would I include this measure? Do I just put 1 in every single row? Because regardless of the wait-time, they've gotten the service only once (you cannot go above/below 1 in my scenario).

1) Think about the grain of your existing fact table. It sounds like it's probably "an occasion on which a person received a service." If that's the same thing you're trying to count, then yes - the waiting time and the count are the same grain.
However, while they may well be the same grain, there might be no need to add anything to the table. Read point 2 for an explanation.
2) You could put a 1 in a column on every row, but I'm not sure what you'd gain from it. You've not said what tools will be consuming this data, but you should be able to do a count/distinct count of some kind.
Working on the basis that you've tagged SSIS so are likely using Microsoft's BI stack:
TSQL has count(), and you can do count(distinct [column]).
SSAS has both counts and distinct counts as aggregation types.
MDX offers several different types of count.
SSRS has Count, CountDistinct, and CountRows.
Whether you do a normal count or a distinct count will depend on whether you're trying to ask "How many people used this service?" or "How many different people used this service?"

Related

DWH SCD type 2 implementation in SQL Server scd2 and scd1

We are implementing a new dwh solution. I have many dimensions that require slowly changing type 2 attributes. I was considering implementing a combination of Type 2 and Type 1 attributes in my dimension. That is for some dimension attributes, we track history by inserting new rows in the dim table (Type2), for other attributes we will just update the existing row for any changes (Type1)
Questions:
Is this a good practice? is it OK to have a combination of SCD 1 and 2 for the same dim?
Is there any limit on the number of SCD 2 attributes in a dimension? My dimension is pretty wide, like 300 cols, out of which one of the users is requesting for about 150 cols to be tracked by scd type 2. Is it OK to have so many scd2 attributes in a dim? Is there going to be any impact on performance of downstream reporting BI solutions like cubes and dashboards because of this?
In the OLTP system, we maintain an "audit" table to log any updates. Though this is not in a very easily queryable format, we get answers to most of our questions related to changes from this. We don't need much reporting on data changes. Of course there are some important columns like Status for which we definitely need SCD2 but rest of the columns, I am not sure having history for lot of other columns in the DWH adds any value. My question is when we have this audit table in OLTP, how do I decide what attributes need SCD 2 in the DWH?

Good practice? Yes. Standard feature of dimensional modelling that is overlooked too often. I've seen dimensions with combinations of SCD0, SCD1 and SCD2, and there's nothing to prevent other SCD-types being used as well.
No limit on columns, but that does seem a little excessive. You probably want to use a "hash" method to detect the SCD2 changes, where you calculate a hash over the SCD2 columns, and use this value to detect if any of the columns have changed.
Sorry, but I don't understand the question about audit logs. Are these logs your data source?

How does one perform a "range query"?

Google cloud dataflow supports what I would call a "full outer join" SQL like statement through their "CoGroupByKey"method. However, is there any way to implement in dataflow what would be in SQL a "range join"? For example, if I had a table called "people" in which there was a floating point field called "age". And let's say I wanted all the pairs of people in which their ages were within say five years from each other. I could write the following statement:
select p1.name, p1.age, p2.name, p2.age
from people p1, people p2
where p1.age between (p2.age - 5.0) and (p2.age + 5.0);
I couldn't determine if there was a way to accomplish this in dataflow. (Again, if I wanted a strict equality, that I could use a CoGroupByKey, but in this case it's not a strict equality condition).
For my particular use case, the "people" table is not too large – maybe 500,000 rows and approximately 50 megs of RAM required. So, I could, I think, simply run a asList() method to create a single object that sits in a single computer's RAM and then just sort the people object by age and then write some sort of routine that "walks through the list from the low stage to the highest age" and while walking through the list outputs those people whose ages are less than 10 years from each other. This would work, but it would be single threaded etc. I was wondering if there was a "better" way of doing it using the dataflow architecture. (And other developers may need to find a "dataflow" way of doing this operation if the object that they were dealing with dies not fit nicely into memory of one single computer, e.g. a people table of maybe 1 billion rows etc.)

The trick to making this work efficiently at scale is to partition your data into sets of potential matches. In your case, you could assign each person to two different keys, age rounded up to multiple of 5, and age rounded down to multiple of 5. Then, do a GroupByKey on these buckets, and emit all the pairs within each bucket that are actually close enough in age. You'll need to eliminate duplicates, since it's possible for two records to both end up in the same two buckets.
With this solution, the entire data does not need to fit in memory, just each subset of the data.

Fact table design guidance for 100s of facts

I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.

1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.

How to efficiently fetch n most recent rows with GROUP BY in sqlite?

I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?

There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.

This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!

Can 2 Cubes in a Data Warehouse be directly compared against each other?

Is there a way to compare all information (aggregates, down to the detail level) between two OLAP cubes? For example, say I wanted to compare one cube created to work with sql server 2000 to that same cube, but migrated to run on sql server 2005/2008 - technically they should both return the same information for all dimension / measure combinations but I need a way to verify.
I am definitely NOT a developer, but I do have access to enterprise manager, and potentially SAS tools etc. and I know a bit of SQL but not much else. I know that you can compare two dimensional (i.e. tables) data sets with sql queries, and also with SAS - but I have never heard of a way to compare three dimensional cubes.
Am I out of luck on this one? The last thing that I want to have to do is view both cubes and compare all possible results side by side via excel or something, I hope that it can be automated somehow.

Comparing cubes means doing enough "slice-and-dice" queries to prove that you've queried all of the facts.
You can, simply, get a sum and count of the various fact and dimension tables. If those are the same, odds are good that any particular query will be the same between the two.
Without details on the dimensions and facts in question, it's hard to make a more specific recommendation.
However, consider that you can easily compute a set of subtotals for each dimension of the cube. If the dimensions are the same number of rows, the results will be the same number of rows. If the grand total is the same, then all that's left is row-by-row comparison of the subtotals.
If you do this once for each dimension, you should have some confidence that they're the same. Or, you'll find a difference that you can explore with more detailed queries.

The best approach is to compare the cube data by interchanging the rows and columns and verifying if all the counts and totals match properly.
For example, if you are having year-wise totals for a particular location, it would be a good approach to interchange the values between locations and the months and verifying whether they match properly.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart