Clean cumulative sum alongside grouped sum - psql

I am working in PostgreSQL 9.6.6
For the sake of reproducibility, I'll use create tempory table to create a "constant" table to play with:
create temporary table test_table as
select * from
(values
('2018-01-01', 2),
('2018-01-01', 3),
('2018-02-01', 1),
('2018-02-01', 2))
as t (month, count)
A select * from test_table returns the following:
month | count
------------+-------
2018-01-01 | 2
2018-01-01 | 3
2018-02-01 | 1
2018-02-01 | 2
The desired output is the following:
month | sum | cumulative_sum
------------+-----+----------------
2018-01-01 | 5 | 5
2018-02-01 | 3 | 8
In other words, the values have been summed, grouping by month, and then the cumulative sum is displayed in another column.
The issue is that the only way I know to achieve this is somewhat convoluted. The grouped sum must be computed first, (as with a sub select or with statement), and then the running tally is computed with a select statement against that table, as so:
with sums as
(select month,
sum(count) as sum
from test_table
group by 1)
select month,
sum,
sum(sum) over (order by month) as cumulative_sum
from sums
What I wish could work would be something more like...
select month,
sum(count) as sum,
sum(count) over (order by month) as cumulative_sum
from test_table
group by 1
But this returns
ERROR: column "test_table.count" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: sum(count) over (order by month) as cumulative_sum
No amount of fussing with the group by clause seems to satisfy PSQL.
TL,DR: is there a way in PSQL to compute both a sum over groups and the cumulative sum over groups using just a single select statement? More generally, is there a "preferred" way to accomplish this, beyond the method I use in this question?

Your hunch to use SUM as an analytic function was on the right track, but you need to analytic sum the aggregate sum:
SELECT month,
SUM(count) as sum,
SUM(SUM(count)) OVER (ORDER BY month) AS cumulative_sum
FROM test_table
GROUP BY 1;
Demo
As to why this works, the analytic functions are applied after the GROUP BY clause has happened. So the aggregate sum in fact is available when we go take the rolling sum.

Related

How to calculate a sum conditionally based on the values of two other columns

Feels like this may be a basic, but I cannot find a way to do this with sumifs.
I've got four columns in one table, representing the workload of an employee like this:
Job
Employee # 1
Employee # 2
Workload
Job 1
Bob
Jane
5
Job 2
Bob
2
Job 3
Jane
Susan
3
Job 4
Susan
2
I'd like to output total workflow results to a second sheet for each employee based on a specialized formula. In English, the forumla would be:
Calculate the total workload for each employee.
- For each job that includes that includes employee named "X" and no assigned teammate, use the job's corresponding workload value.
- For each job that includes that includes employee named "X" and has an assigned teammate, reduce the corresponding workload value by 50%.
So with the given table above, I'd want an output like this:
Employee Name
Workload
Bob
4.5
Jane
4
Susan
3.5
Math:
Bob = ((job_1 / 2) + job_2)
Jane = ((job_1 / 2) + (job_3 / 2))
Susan = ((job_3 / 2) + job_4)
Does anyone know how I can accomplish this?
Functions like sumifs seem to only let me set criteria to sum or not sum a value. But I cannot find a clear way to sum only 50% of a value based on a condition in a separate column.
=LAMBDA(name,SUMIFS(D2:D5,B2:B5,name,C2:C5,"")+SUMIFS(D2:D5,B2:B5,name,C2:C5,"<>")/2+SUMIFS(D2:D5,C2:C5,name)/2)("Bob")
The math is
SUM D, if B is Bob and C is empty and
SUM half of D if B is Bob and C is not empty and
SUM half of D, if C is Bob.
Or the same logic via query:
=QUERY(
{
QUERY(B1:D5,"Select B,sum(D) where C is null group by B");
QUERY(B1:D5,"Select C,sum(D)/2 where C is not null group by C");
QUERY(B1:D5,"Select B,sum(D)/2 where C is not null group by B")
},
"Select Col1, sum(Col2) where not Col1 contains 'Employee' group by Col1"
)
However, note that we're assuming title contains Employee and no other names contain Employee.
Employee # 1
sum sum Workload
Bob
4.5
Jane
4(Incorrect in the question)
Susan
3.5
Use this formula
=ArrayFormula({ "Employee Name", "Workload";
QUERY(
QUERY({LAMBDA(a,b,c,d,k, {b,a,{d/k};c,a,{d/k}} )
(A2:A,B2:B,C2:C,D2:D,
BYROW(B2:C, LAMBDA(c, IF(COUNTA(c)=0,,IF(COUNTA(c)=1,1,2)))))},
"Select (Col1),sum(Col3) Group by Col1" ,0), "Where Col1 <> '' ",0)})
Used formulas help
ARRAYFORMULA - QUERY - LAMBDA - BYROW - IF - COUNTA - SUM

How to Query in google sheets, sort by a column and not include that column in the output

I would like to query in google sheets and sort the query by a specific column in accending order, and have a secondary sort that is also in ascending order. I already know how to do this by
=QUERY(A:C,"select * where month(A)+1 = 1 order by A,B ",0)
Here i queried 3 columns month, unique ID, and name. I selected the data with the necessary month, and sorted it by month, followed by a secondary sort of unique ID. But this query outputs 3 columns. How would i change the formula so the output does not include the month column anymore.
Your question is:
How would i change the formula so the output does not include the month column anymore.
Try wrapping your QUERY formula within another QUERY. Like:
=QUERY(QUERY(your_query_here),"select Col2, Col3")
For your given example it would be:
=QUERY(QUERY(A1:C22,"select * where month(A)+1 = 1 order by A,B ",0),
"select Col1, Col3")

Google sheets binning and group by (custom time interval)

I wish to count events that occurred within a custom time interval : it could be within 24h, or within a week or 2-months span.
I am using google sheets: I can create a pivot table and group by month, however I'd like to explore insights using custom intervals (I'm looking for pattern in epilepsy).
As final result, I wanna have a table that, for each day, it is reported the number of frequencies within that interval.
Particularly, I wanna focus on the interval of 24h to count the number of events of epilepsy (known as cluster seizures).
And then, on custom days intervals to explore periodicity or trends - like each 48 hours, or each 15 or 30 days.
See a mockup of Google Sheet here:
https://docs.google.com/spreadsheets/d/1tCxYV5mUcq6vKm8-fL-0HUAOjcB9fipLCqPD2Znv-X0/edit#gid=1372548551
I tried this attempts:
find out how many events occurred in the last 30 days prior to the reported date:
= IFERROR(
QUERY(
A:E,
"SELECT COUNT(A)
WHERE
A IS NOT NULL AND
E = FALSE AND
A >= date '" &
TEXT(
A2-30,
"yyyy-MM-dd"
) &"' AND
A <= date '" &
TEXT(
A2,
"yyyy-MM-dd"
) &"'
LABEL COUNT(A) '' "), "N/A")
Then, dragging the cell, I get the column "# events in the prior 30 days".
It works but seems a bit messy - especially for updating the intervals.
I tried this other approach:
=query(B:E, "select B, count(E), -1+count(E) where E = FALSE group by B label B 'Date with Clusters', count(E) 'Cluster seizures '")
That produces the last table.
I like this approach better, but here I am just grouping by the same date, without possibility to have a custom interval.
As an example, I will have that two events will be counted within the same day, not withing the same 24h interval.
Could you tell a better approach to handle datetime differences, so to create binning and group by with custom intervals ?
Below an example:
on the left table, data in input; on the middle column, result of first approach; on the right table, results of second approach.
given the table:
in order to group stuff with QUERY we need to "fix" the A column in order to get a custom period. lets say we need to group events every 3 weeks (21 days). we take the lowest and highest date and create a sequence with all the dates in between.
=INDEX(ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A))))
then we use running total on it to get every date which is 21 days apart from the previous/next one. we could use simple SEQUENCE (for min>max) to create this array but with SEQUENCE we cant go "back in time" (for max>min) so we use MMULT and negative number
therefore, to start from a frame of the first date and create 3 weeks group by windows (eg. min>max) we use:
=ARRAYFORMULA({MIN(A2:A); MIN(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*21); SiGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))})
and to get a reverse of it and start from frame of end date and create 3 weeks windows backwards (eg. max>min) we use:
=ARRAYFORMULA({MAX(A2:A); MAX(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*-21); SiGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))})
at this stage, we can start fixing the A column via VLOOKUP and 4th argument set to 1 - approximate mode (instead of 0 - exact match mode) so forward in time will be:
=ARRAYFORMULA(IFNA(VLOOKUP(A2:A; SORT({MIN(A2:A); MIN(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*21); SIGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))}); 1; 1)))
and backward in time shall be:
=ARRAYFORMULA(IFNA(VLOOKUP(A2:A; SORT({MAX(A2:A); MAX(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*-21); SIGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))}); 1; 1)))
and now we just create a virtual array {} and pair fixed column A with column C and input it as range into QUERY
side note:
to put columns next to each other in english spreadsheets we use ,
to put columns next to each other in non-english spreadsheets we use \
=ARRAYFORMULA(QUERY({IFNA(VLOOKUP(A2:A; SORT({MIN(A2:A); MIN(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*21); SIGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))}); 1; 1))\ C2:C};
"select Col1,count(Col1)
where Col2 = FALSE
group by Col1
order by count(Col1) desc
label count(Col1)''"))
and backwards in time:
=ARRAYFORMULA(QUERY({IFNA(VLOOKUP(A2:A; SORT({MAX(A2:A); MAX(A2:A)+MMULT(TRANSPOSE((
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))<=TRANSPOSE(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))*-21); SIGN(
ROW(INDIRECT(MIN(A2:A)&":"&MAX(A2:A)))))}); 1; 1))\ C2:C};
"select Col1,count(Col1)
where Col2 = FALSE
group by Col1
order by count(Col1) desc
label count(Col1)''"))
demo spreadsheet

Merging two data sets in order to add default values for missing data

I'm trying to merge two datasets in order to insert default rows for missing data. The use case is that I have a list of dates and attendance numbers for training sessions on those dates, but if I have no records at all for a training session then it's missing from the list.
In my sheet at the moment I have a two column set of dates and attendance numbers, and in another sheet I have worked out all the Wednesdays and Fridays (training days) between the start and end dates of all the sessions we have data for.
Is there a way to merge the two datasets together so that the zero attendance for each session is the base set and then I merge in the rows for which I have data? I've tried using some of the query command but if I specify two datasets using {Sheet1!A1:A,Sheet2!B1:B} I get array errors.
The attendance information is currently gathered with a query like this:
=QUERY({Records!A2:B}, "SELECT Col1, COUNT(Col2) WHERE (Col1 IS NOT NULL) GROUP BY Col1 ORDER BY Col1 ASC LABEL Col1 'Session Date', COUNT(Col2) 'Skaters'") where the Records sheets is just date and names.
If I update it to read from two datasets (=QUERY({Records!A2:B, Scratch!B2:B}, "SELECT Col1, COUNT(Col2) WHERE (Col1 IS NOT NULL) GROUP BY Col1 ORDER BY Col1 ASC LABEL Col1 'Session Date', COUNT(Col2) 'Skaters'")then I get a REF error of Function ARRAY_ROW parameter 2 has mismatched row size. Expected: 982. Actual: 999. Seems fair, as it's created misaligned dataset, rather than merging based on the date column.
I'm probably treating the spreadsheet a bit too much like a database, and while I would be more comfortable dropping into the script editor to resolve this I'm trying to learn a few spreadsheet techniques.
Data
Records looks like this:
| 2018-05-04 | Bob |
| 2018-05-04 | Fred |
| 2018-05-12 | Bob |
So no-one took attendance on the 9th, and so the stats are skewed as Bob gets a misleading 100% attendance record.
I do not understand the details of what you are trying to do but since it seems to involve combining one list of just dates and at least two lists of dates and names offer the following example:
The formula is:
=ArrayFormula(query({Sheet1!B1:C20;Sheet2!E1:F20;Sheet3!I1:J20},"select * where Col2 is not NULL order by Col1 "))

how to count elements in google spreadsheets matching a certain date range

I have some data in columns with a timestamp in the first column and data columns.
A B C D
+++++++++++++++++++++++++++++++++++
20.5.2011 1 2 5
18.5.2011 3 5 4
12.5.2013 4 7 5
I am able to successfully filter columndata based on the timestamp with this google spreadsheets formula. The below returns a sum of all integers in column B if there is a corresponding 2011 timestamp.
=ArrayFormula(SUMIF(TEXT($A:$A;"yyyy");year(today())-1;$B:$B))
the above sums up the values 1 and 3 from column b and returns 4
The question is, how would I calculate the average for the above values 1 and 3 resulting in 2? My current approach is to divide the above formula by the count() of items that match the date criterion but I cannot get it to work.
=ArrayFormula(SUMIF(TEXT($A:$A;"yyyy");year(today())-1;$B:$B))/WFORMULA FOR THE DIVISOR
Any ideas?
You can use COUNTIF in much the same way as you used SUMIF:
=ArrayFormula(SUMIF(TEXT($A:$A;"yyyy");YEAR(TODAY())-1;$B:$B)/COUNTIF(TEXT($A:$A;"yyyy");YEAR(TODAY())-1))
(this would currently return the average of all the 2012 entries).
You can simplify this a little by using the YEAR function in the comparison array:
=ArrayFormula(SUMIF(YEAR($A:$A);YEAR(TODAY())-1;$B:$B)/COUNTIF(YEAR($A:$A);YEAR(TODAY())-1))
You can also generate a table of sums, averages or counts quite easily with QUERY:
=QUERY(A:B;"select year(A), avg(B) where A is not null group by year(A) label year(A) 'Year', avg(B) 'Average'")
and if you just wanted the average for 2012 as a single value:
=QUERY(A:B;"select avg(B) where year(A) = "&(YEAR(TODAY())-1)&" group by year(A) label avg(B) ''")

Resources