Data warehouse rollup and grouping sets, which to use?

Data warehouse rollup and grouping sets, which to use? - data-warehouse

I have learned rollup, cube & grouping sets but one thing confuses me is how do I know which to use. For example, if I need to find the sale for each month in 2006 by region & by manager the two queries follow
SELECT month, region, sales_mgr, SUM(price)
FROM Sales
WHERE year = 1996
GROUP BY GROUPING SETS((month, region),(month, sales_mgr))
and
SELECT month, region, sales_mgr, SUM(price)
FROM Sales
WHERE year = 1996
GROUP BY ROLLUP(month, region, sales_mgr)
I know the result of each one but I don't know which to use to answer the question properly, is there something I missed or are both considered correct?

ROLLUP and CUBE are just shorthand for two common usages of GROUPING SETS.
GROUPING SETS gives more precise control of which aggregations you want to calculate.

Related

How to Model Date Dimensions with Fact Tables of Different Grains

We have some use cases for our DW where we have fact tables at different grains - e.g., sales by store by day (fact 1) and sales budget targets by month (fact 2). They both involve Date as a grain, but in one case the grain is day and the other the grain is period.
Assuming we can't in the near term change the grain, what's the right way to model this?
A Date and a Month dimension, which will have conformed attributes?
1 Date dimension, with nulls or flags or something when it's representing a higher value (e.g., month)
Something else?

You only need one date dimension with one row per day. Just link to the last day of your period.
E.g. for a monthly aggregated fact just link to the last day of the month in your date dimension.

Two different dimensions, one for Date and one for Month

How to compare Individual data to aggregate group by in Google Query (SQL)?

I have a table like this, with the current table in blue, and the desired results highlighted in yellow:
And my goal is to set up a query in Google Sheets using their built-in =QUERY() function (note: based on the Google's own Query language, which is very similar to SQL) that can essentially do this entire table, without adding extra formulas. I know how to find the monthly averages separately, in a style like
SELECT month(DateRun), average(metric) GROUP BY month(DateRun)
But how could you have it so it's like
SELECT AdID, DateRun, Metric, average(Metric for Associated Month), IndividualMetric - AverageForMonth
I have tried to find it on my own, but have not been able to find a resource that I'm able to transform for my own usage.
I learned sub-queries a while back, and have a feeling that maybe the answer to this but I am very lost.
Please let me know if I can provide any additional information.

try:
=ARRAYFORMULA(IFNA(VLOOKUP(MONTH(B2:B),
QUERY(B2:C, "select month(B)+1,avg(C) group by month(B)"), 2, 0)))
and:
=ARRAYFORMULA(IF(B2:B="",,C2:C-D2:D))
UPDATE:

You seem to be looking for a window average:
select
AdID,
DateRun,
Metric,
avg(Metric) over(
partition by Metric, date_trunc(DateRun, month)
) month_avg_metric,
Metric - avg(Metric) over(
partition by Metric, date_trunc(DateRun, month)
) diff_with_month_avg_metric
from mytable
The last but one column in the resultset gives you the average of the metric for the on-going month, and the last column computes the difference between the current metric and its monthly average.

InfluxDB and Grafana graph using midnight as 0 on Y-axis derivative

I am graphing with Grafana (2.6.0) and I have an InfluxDB (0.10.2) database with the following data in it:
> select * from "WattmeterMainskwh" where time > now() - 5m
name: WattmeterMainskwh
-----------------------
time value
1457579891000000000 15529.322
1457579956000000000 15529.411
1457580011000000000 15529.425
1457580072000000000 15529.460
1457580135000000000 15529.476
...etc...
This data collects my household kilowatt usage as measured by a kWH gauge that steadily increments the usage value across months or years. I cannot easily reset the counter, nor do I wish to do so.
My goal is to create a graph that shows my daily kWH use over 24 hour periods starting at midnight, or at a minimum showing relative kWH over the interval displayed. This type of graph would be useful in many other circumstances as well where I could imagine "errors across the day" or "visitors since opening time" or "BGP resets per calendar week" were useful but the collection counter was not reset to zero upon the reset or turn-over of the time interval. This kind of counting is actually quite common in my experience.
This graph works, but doesn't show me what I'm looking for:
SELECT derivative(mean("value")) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
That graph just shows the difference between one sample and the previous sample. What I want is a steadily increasing line starting from the left side of the graph and increasing towards the right side of the graph, with zero as the bottom of the Y axis, and the graph starting at zero at the farthest left X value.
This graph works too and shows me the correct curve, but it's off by fifteen thousand or so. So far, it's the closest to what I want but since this is an ever-increasing counter that can't be reset I need to subtract some from the Y axis. Ideally, I'd like to subtract whatever the value was at the previous midnight from each sample to get a relative number based on a day instead of an absolute based on all time.
SELECT sum("value") FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
And here's the graph from that previous statement:
Graph that is off by 15k
This attempt didn't work - I apparently can't take a sum of a derivative group:
SELECT sum(derivative(mean("value"))) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
This doesn't work, either - I can't perform functions within "derivative":
SELECT derivative(sum("value")-first("value")) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
Of course, I could just create a new value that had calculations applied to it before I wrote it into InfluxDB, but that seems to me to be a data-redundant and sloppy way to solve this problem, as well as being quite inflexible if I want to look at other intervals on a whim. I'm hoping that there is some way to do this more elegantly within the combination of InfluxDB & Grafana, but I'm just not able to find it with the search terms I've used or the thinking I've put towards interpreting the documentation.
Is this type of graph even possible with InfluxDB/Grafana? As far as I can tell a continuous query is not a solution, and the lack of nested SELECTs makes even the hackish ways of doing this not obvious to me.
BONUS: It would be really great to have the graph show midnight every night as a "zero" location, instead of "zero" being the first point in the displayed interval, so looking at five days of normal data would show five distinct "waves" of increasing daily aggregate energy usage, with the wave Y value going back down to zero at 12:00:01 on each day. But I'll take whatever I can get.

Nested functions have only partial support. However, you can effectively nest functions by chaining Continuous Queries.
Use a CQ to calculate the derivative(mean(value)) and store that in a new measurement foo. Then for your graph you can query select sum(value) from foo.

(I know this answer is quite late, but it might help others. Oh, and please excuse me for all the Dutch in my graphs; I had to keep it in dutch for the highest possible WAF)
You could do what I do for my kWh calculations:
Which results in a simple query like this:
SELECT distinct("kwh_combined") FROM "smartmeter" WHERE $timeFilter GROUP BY time($__interval) fill(linear)
In order to get your total count.. or if you want it in a nice graph like this which shows the number of kWh's used per hour in the bars and the yellow line (I normally run in dark mode, excuse the yellow) which is my current WATT power draw:
This data (or at least your hourly usage in bars) can be retrieved by a query like this:
Which is this exact query (for B):
SELECT spread("kwh_combined") FROM "smartmeter" WHERE $timeFilter GROUP BY time(1h) fill(null)
... where the 'kwh_combined' is (still) my counter just counting up and up.
All this results in me being able to 'query' the InfluxDB for a certain time period, like "last 24 hours" to come up with a nice panel like this: (ignore the encircled prices, that was for a question I posted I just made 10 minutes ago, check my PS)
I hope this helps you or anyone else; it took me some figuring out, but I'm happy to give something back to the community :)
PS: Don't be as stupid as I was and hardcode your electrical and gas prices into your dashboard but store them with your measurements as they could change over time.

I had the same problem (same application even) and solved it here. In your case, the query should be roughly:
SELECT value-value_fill FROM
(SELECT first(value) as value_fill FROM WattmeterMainskwh WHERE time>now()-7d GROUP BY time(1d)),
(SELECT first(value) as value FROM WattmeterMainskwh WHERE time>now()-7d GROUP BY time(1h))
fill(previous)

Joining Two Data-sets in PowerPivot by Month

I've got 2 different data sets, revenue and contracts sold, that I need to join based off of year and month in PowerPivot so when I use my slicers, they'll filter accordingly. I know part of this will involve coming up with some temp tables for year and month but I can't get those to work. In the contracts sold table, there is an actual date column which I'm then using to format the year/month in "MM-MMM" format:
However, the revenue comes in only as a YYYYMM format:
So the solution would have to take into account this aspect as well. It's been a while since I've dealt with PowerPivot and I recall the PowerPivotPro or Kasep de Jonge's site containing something about linking tables based off of common month but I can't find those pages anymore. If anyone could point me in the right direction or give me some insight, it'd be greatly appreciated.
I'm using Excel 2010 with PowerPivot version 11.0.3000.0.
Thanks,
Joshua

Joshua, I think the solution can be quite simple:
In the contracts sold table, create a new calculated column (a new column within a powerpivot window) that would give you the same date format as is in the revenue table (YYYYMM).
Use Create Time Dimension app in Excel 2013 -- this app creates a date-table with unique dates which makes everything much easier. As with the other table, create a new calculated column with the same format (YYYYMM).
Make a relationship between those tables -- the date table will be linked to revenue as well as contracts.
Created required measures (like sums of revenue, number of contracts etc.).
Place a new pivot table - rows will probably be date-based (YYYYMM), with measures coming from both tables it should be easy to create a report that you need.

Data warehouse reporting questions

I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!

The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.

Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Data warehouse rollup and grouping sets, which to use? - data-warehouse

ROLLUP and CUBE are just shorthand for two common usages of GROUPING SETS. GROUPING SETS gives more precise control of which aggregations you want to calculate.

Related

How to Model Date Dimensions with Fact Tables of Different Grains

How to compare Individual data to aggregate group by in Google Query (SQL)?

InfluxDB and Grafana graph using midnight as 0 on Y-axis derivative

Joining Two Data-sets in PowerPivot by Month

Data warehouse reporting questions

Categories

Resources