Data Warehouse Design of Fact Tables

Data Warehouse Design of Fact Tables - data-warehouse

I'm pretty new to data warehouse design and am struggling with how to design the fact table given very similar, but somewhat different metrics. Lets say you were evaluating the below metrics, how would you break up the fact tables (in this case company is a subset of client). Would you be able to use one table for all of this or would each metric being measured warrant its own fact table or would each part of the metric being measured be its own column in one fact table?
Total company daily/monthly/yearly # of files processed
Total company daily/monthly/yearly file sizes processed
Total company daily/monthly/yearly # files errored
Total company daily/monthly/yearly # files failed
Total client daily/monthly/yearly # of files processed
Total client daily/monthly/yearly file sizes processed
Total client daily/monthly/yearly # files errored
Total client daily/monthly/yearly # files failed

By the looks of the measure names, I think you'll be served with a single fact table with a record for each file and a link back to a date_dim
create table date_dim (
date_sk int,
calendar_date date,
month_ordinal int,
month_name nvarchar,
Year int,
..etc you've got your own one of these ..
)
create table fact_file_measures (
date_sk,
file_sk, --ref the file_dim with additonal file info
company_sk, --ref the company_dim with the client details
processed int, --should always be one, optional depending on how your reporting team like to work
size_Kb decimal -- specifiy a size measurement, ambiguity is bad
error_count int -- 1 if file had error, 0 if fine
failed_count int -- 1 or file failed, 0 if fine
)
so now you should be able to construct queries to get everything you asked for
for example, for your monthly stats:
select
c.company_name,
c.client_name,
sum(f.file_count) total_files,
sum(f.size_Kb) total_files_size_Kb,
sum(f.file_count) total_files,
sum(f.file_count) total_files
from
fact_file_measure f
inner join dim_company c on f.company_sk = c.company_sk
inner join dim_date d on f.date_sk = d.date_sk
where
d.month = 'January' and d.year = "1984"
If you needed to have the side by side 'day/month/year' stuff, you can construct year and month fact tables to do the roll ups and join back via dim_date's month/year fields. (You could include month and year fields in the daily fact table, but these values may end up being miss-used by less experienced report builders) It all goes back to what your users actually want - design your fact tables to their requirements and don't be afraid to have separate fact tables - data warehouse is not about normalization, its about presenting the data in a way that it can be used.
Good luck

Related

[Google Data Studio]: Can't create histogram as bin dimension is interpreted as metric

I like to make a histogram of some data that is saved in a nested BigQuery table. In a simplified manner the table can be created in the following way:
CREATE TEMP TABLE Sessions (
id int,
hits
ARRAY<
STRUCT<
pagename STRING>>
);
INSERT INTO Sessions (id, hits)
VALUES
( 1,[STRUCT('A'),STRUCT('A'),STRUCT('A')]),
( 2,[STRUCT('A')]),
( 3,[STRUCT('A'),STRUCT('A')]),
( 4,[STRUCT('A'),STRUCT('A')]),
( 5,[]),
( 6,[STRUCT('A')]),
( 7,[]),
( 8,[STRUCT('A')]),
( 9,[STRUCT('A')]),
(10,[STRUCT('A'),STRUCT('A')]);
and it looks like
id
hits.pagename
1
A
A
A
2
A
3
A
A
and so on for the other ids. My goal is to obtain a histogram showing the distribution of A-occurences per id in data studio. The report for the MWE can be seen here: link
So far I created a calculated field called pageviews that evaluates the wanted occurences for each session via SUM(IF(hits.pagename="A",1,0)). Looking at a table showing id and pageviews I get the expected result:
table showing the number of occurences of page A for each id
However, the output of the calculated field is a metric, which might cause trouble for the following. In the next step I wanted to follow the procedure presented in this post. Therefore, I created another field bin to assign my sessions to bins according to the homepageviews as:
CASE
WHEN pageviews = 0 OR pageviews = 1 THEN "bin 1"
WHEN pageviews = 2 OR pageviews = 3 THEN "bin 2"
END
According to this bin-defintion I hope to obtain a histogram having 6 counts in bin 1 and 4 counts in bin 2. Well, in this particular example it will actually have 4 counts in bin one as ids 5 and 7 had "null" entries, but never mind. This won't happen in my real world table.
As you can see in the next image showing the same table as above, but now with a bin-column, this assignment works as well - each id is assigned the correct bin, but now the output field is a metric of type text. Therefore, the bar-chart won't let me use it (it needs it as dimension).
Assignment of each id to a bin
Somewhere I read the workaround to create a selfjoined blend, which outputs metrics as dimension. This works only by name: my field is now a dimension and i can use it as such for the bar-chart, but the bar chart won't load and shows a configuration error of the data source, which can be seen in this picture:
bar-chart of id-count over bin. In the configuration of the chart one can see that "bin" is now a dimension. The chart won't plot, however, as it shows a data configuration error (sorry for the German version of data studio).

stack data and restructure without using var to cases or casestovar in SPSS

I have the following situation: a loop (stack data) with only 1 index variable and with multiple items corresponding to the statements, as in the picture below (sorry it is Excel, but is the same as in SPSS):
stack data - cases on multiple lines, but never filling for 1 respondent all the columns
I want to reach to the following situation but without using casestovars to restructure, because that creates a lot of empty variables. I remember for older versions it was a command like Update, which was moving up the cases, to reach the following result:
reducing the cases per respondent
Like starting from this:
ID Index Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1
1 2 1 1
1 3 1 1
To reach to this:
ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1 1 1 1
But without using casestovars. Is there any command in SPSS syntax for this?
Thank you very much, have a nice day!

Not entirely sure how variable your data structure is likely to be in reality but if as demo'ed where you have only a single response for each q1_1 to q1_6 per respondent ID, then the below would be sufficient:
dataset declare dsAgg.
aggregate outfile="dsAgg" /break=respid /q1_1 to q1_6=max(q1_1 to q1_6).
Also not sure of the significance of duplicate index values within the same respondent IDs, if this was intended or not.

The following syntax could do the job -
* first we'll recreate your example data.
data list list/respid index q1_1 to q1_6.
begin data
1,1,1,,,,,
1,2,,2,,,,
1,3,,,1,,,
1,4,,,,2,,
1,5,,,,,1,
1,6,,,,,,2
2,1,3,,,,,
2,1,,4,,,,
2,2,,,5,,,
2,2,,,,4,,
2,3,,,,,3,
2,3,,,,,,2
end data.
* now to work: first thing is to make sure the data from each ID are together.
sort cases by respid index.
* the loop will fill down the data to the last line of each ID.
do repeat qq=q1_1 to q1_6.
if respid=lag(respid) and missing(qq) qq=lag(qq).
end repeat.
* the following lines will help recognize the last line for each ID and select it.
compute lineNR=$casenum.
aggregate /outfile=* mode=ADDVARIABLES/break=respid/MXlineNR=max(lineNR).
select if lineNR=MXlineNR.
exe.

SUM(LAST()) on GROUP BY

I have a series, disk, that contains a path (/mnt/disk1, /mnt/disk2, etc) and total space of a disk. It also includes free and used values. These values are updated at a specified interval. What I would like to do, is query to get the sum of the total of the last() of each path. I would also like to do the same for free and for used, to get a aggregate of the total size, free space, and used space of all of my disks on my server.
I have a query here that will get me the last(total) of all the disks, grouped by its path (for distinction):
select last(total) as total from disk where path =~ /(mnt\/disk).*/ group by path
Currently, this returns 5 series, each containing 1 row (the latest) and the value of its total. I then want to take the sum of those series, but I cannot just wrap the last(total) into a sum() function call. Is there a way to do this that I am missing?

Carrying on from my comment above about nested functions.
Building a toy example:
CREATE DATABASE FOO
USE FOO
Assuming your data is updated at intervals greater than[1] every minute:
CREATE CONTINUOUS QUERY disk_sum_total ON FOO
BEGIN
SELECT sum("total") AS "total_1m" INTO disk_1m_total FROM "disk"
GROUP BY time(1m)
END
Then push some values in:
INSERT disk,path="/mnt/disk1" total=30
INSERT disk,path="/mnt/disk2" total=32
INSERT disk,path="/mnt/disk3" total=33
And wait more than a minute. Then:
INSERT disk,path="/mnt/disk1" total=41
INSERT disk,path="/mnt/disk2" total=42
INSERT disk,path="/mnt/disk3" total=43
And wait a minute+ again. Then:
SELECT * FROM disk_1m_total
name: disk_1m_total
-------------------
time total_1m
1476015300000000000 95
1476015420000000000 126
The two values are 30+32+33=95 and 41+42+43=126.
From there, it's trivial to query:
SELECT last(total_1m) FROM disk_1m_total
name: disk_1m_total
-------------------
time last
1476015420000000000 126
Hope that helps.
[1] Picking intervals smaller than the update frequency prevents minor timing jitters from making all the data being accidentally summed twice for a given group. There might be some "zero update" intervals, but no "double counting" intervals. I typically run the query twice as fast as the updates. If the CQ sees no data for a window, there will be no CQ performed for that window, so last() will still give the correct answer. For example, I left the CQ running overnight and pushed no new data in: last(total_1m) gives the same answer, not zero for "no new data".

Return only results based on current object for dynamic menus

If I have an object that has_many - how would I go about getting back only the results that are related to the original results related ids?
Example:
tier_tbl
| id | name
1 low
2 med
3 high
randomdata_tbl
| id | tier_id | name
1 1 xxx
2 1 yyy
3 2 zzz
I would like to build a query that returns only, in the case of the above example, rows 1 and 2 from tier_tbl, because only 1 and 2 exist in the tier_id data.
Im new to activerecord, and without a loop, don't know a good way of doing this. Does rails allow for this kind of query building in an easier way?
The reasoning behind this is so that I can list only menu items that relate to the specific object I am dealing with. If the object i am dealing with has only the items contained in randomdata_tbl, there is no reason to display the 3rd tier name. So i'd like to omit it completely. I need to go this direction because of the way the models are set up. The example im dealing with is slightly more complicated.
Thanks

Lets call your first table tiers and second table randoms
If tier has many randoms and you want to find all tiers whoes id present in table randoms, you can do it that way:
# database query only
Tier.joins(:randoms).uniq
or
# with some ruby code
Tier.select{ |t| t.randoms.any? }

Best Data Access approach for asp.net mvc application with big data

we have big challenge : Application is too slow because of huge records :(
we look for best data access pattern in web application (asp.net mvc 3)
with big data (sql server view with more than 66 million records.)
customers have forms with search fields and will see results within a grid of web page.
since now
we used both these data access ways:
1- entity framework with some features like compiled query
--get result
GridModelsResult = VwLoanInquiryList.Skip(pageIndex * pageSize).Take(pageSize).ToList();
--get count
int totalPages = (int)Math.Ceiling((float)totalRecords / (float)pageSize);
2- sql server store procedure for pagination and get only current page results
--get result
select t1.* into #t from
(select *
FROM VwLoanInquiry
where CustomerNumber like '%'+#CustomerNumber+'%'
) AS [t1]
WHERE (#PageIndex=-1 or (t1.ID BETWEEN (#PageIndex-1)* #RecordPerPage AND ((#PageIndex)* #RecordPerPage)-1))
--get count
select count(*) from #t
3- OLAP; we made cube in analysis services (SSAS)
but here our problem is to show records with string column type within Cube Cells (its impossible to define Non-numeric fields for measures )
in additon ,
to get result faster,our view has index and data types of column has been set .
please help us to know which one of these ways are better ,or if there is an other way what is that?
thanks

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Data Warehouse Design of Fact Tables - data-warehouse

Related

[Google Data Studio]: Can't create histogram as bin dimension is interpreted as metric

stack data and restructure without using var to cases or casestovar in SPSS

SUM(LAST()) on GROUP BY

Return only results based on current object for dynamic menus

Best Data Access approach for asp.net mvc application with big data

Categories

Resources