Query only records with max value within a group - ruby-on-rails

Say you have the following users table on PostgreSQL:
id | group_id | name | age
---|----------|---------|----
1 | 1 | adam | 10
2 | 1 | ben | 11
3 | 1 | charlie | 12 <-
3 | 2 | donnie | 20
4 | 2 | ewan | 21 <-
5 | 3 | fred | 30 <-
How can I query all columns only from the oldest user per group_id (those marked with an arrow)?
I've tried with group by, but keep hitting "users.id" must appear in the GROUP BY clause.
(Note: I have to work the query into a Rails AR model scope.)

After some digging, you can do use PostgreSQL's DISTINCT ON (col):
select distinct on (users.group_id) users.*
from users
order by users.group_id, users.age desc;
-- you might want to add extra column in ordering in case 2 users have the same age for same group_id
Translated in Rails, it would be:
User
.select('DISTINCT ON (users.group_id), users.*')
.order('users.group_id, users.age DESC')
Some doc about DISTINCT ON: https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT
Working example: https://www.db-fiddle.com/f/t4jeW4Sy91oxEfjMKYJpB1/0

You could use ROW_NUMBER/RANK(if ties are possible) windowed functions:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY group_id ORDER BY age DESC) AS rn
FROM tab) s
WHERE s.rn = 1;

you can use a subquery wuth aggreagated resul in join
select m.*
from users m
inner join (
select group_id, max(age) max_age
from users
group by group_id
) AS t on (t.group_id = m.group_id and t.max_age = m.age)

Related

LibreOffice HSQLDB WHERE clause with LEFT JOIN and MAX?

I'm running macOS 11.6,LibreOffice 7.2.2.2,HSQLDB (my understanding is this is v.1.8, but don't know how to verify)
I'm a newbie to SQL, and I'm trying to write a DB to maintain a club membership roster. I'm trying to find everyone in the DB to whom renewal letters should be sent. The quirk is, if a person has never paid in the past, they should be sent a renewal letter. Old members who haven't renewed recently don't get a renewal, and obviously, each individual should only get one letter. I've created a toy example to display the problem I'm having...
Members table:
Key (Integer, Primary key, Autoincrement)
Name (Varchar)
+-----+----------+
| Key | Name |
+-----+----------+
| 0 | Abby |
| 1 | Bob |
| 2 | Dave |
| 3 | Ellen |
+-----+----------+
Payments table:
Key (Integer, Primary Key, autoincrement)
MemberKey (Integer, foreign key to Member table)
Payment Date (Date)
+-----+-----------+--------------+
| Key | MemberKey | Payment Date |
+-----+-----------+--------------+
| 0 | 0 | 2020-05-23 |
| 1 | 0 | 2021-06-12 |
| 2 | 1 | 2016-05-28 |
| 3 | 2 | 2020-07-02 |
+-----+-----------+--------------+
The only way I've found to include everyone is with a LEFT JOIN. The only way I've found to pick the most recent payment is with MAX. The following query produces a list of everyone's most recent payments, including people who've never paid:
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
GROUP BY "Members"."Key", "Members"."Name"
It returns the result below, which includes all members only once (Abby has 2 payments but only appears once with the most recent payment). Unfortunately it still includes people like Bob who've been out of the club so long that we don't want to send them a renewal notice.
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 1 | Bob | 2016-05-28 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
Where I hit a wall is when I try to perform any kind of conditional operation on the Last Payment, to determine whether it's recent enough to include in the list of renewal notices. For instance, in HSQLDB, the query below returns the error, "The data content could not be loaded. Not a condition." The only change in this query from the 1st one is the addition of the WHERE clause.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Last Payment" >= '2020-01-01'
GROUP BY "Members"."Key", "Members"."Name"
The desired output should look like this:
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
I've been digging around the web trying anything that looks relevant. I've tried "HAVING" clauses--I can make them work with a COUNT(*) function, but I can't make them work with a MAX(*) function. I've tried using my 1st query as a subquery, and applying the WHERE clause on "Last Payment" in the main query. I've tried solutions people say work in MySQL, but I can't get them to work in HSQLDB. I tried using the 1st query as a View, and writing a query against the View. I've tried a dozen other things I don't even remember. Everything past the 1st query above throws an error. I wanted to include my toy DB, but can't find a way to attach it to the post.
Can anyone help please?
This worked for me.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM {oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey"
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL}
GROUP BY "Members"."Key", "Members"."Name"
Result:
This works as well.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL
GROUP BY "Members"."Key", "Members"."Name"
Perhaps the problem you were having is that "Last Payment" is only a column title and not the actual name of any column.

How to join exclusively by date range in Hive SQL?

I have two subqueries that i'd like to join only by the date range between open and closed date from the first table.
First table example:
| id_original | open_datetime | close_datetime |
|-------------|-------------------|-------------------|
| 1 |2019-01-01 10:00:02|2019-01-02 11:00:21|
| 2 |2019-01-01 10:05:52|2019-01-05 16:45:12|
| 3 |2019-01-03 00:00:43|2019-01-03 23:12:44|
Second table example:
| category | all other columns...| open_date |
|----------|---------------------|-------------------|
| A | ... |2019-01-01 11:00:00|
| B | ... |2019-01-02 19:10:10|
| C | ... |2019-01-03 08:23:45|
| D | ... |2019-01-04 18:10:53|
Desired output:
| id_original | category | all other columns...| open_date |
|-------------|----------|---------------------|-------------------|
| 1 | A | ... |2019-01-01 11:00:00|
| 2 | A | ... |2019-01-01 11:00:00|
| 2 | B | ... |2019-01-02 19:10:10|
| 2 | C | ... |2019-01-03 08:23:45|
| 2 | D | ... |2019-01-04 18:10:53|
| 3 | C | ... |2019-01-03 08:23:45|
This is my code:
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
LEFT JOIN (
SELECT *
FROM table2
WHERE other_conditions
) t2 ON t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime
I know that Hive SQL doesn't support inequality as conditions for a JOIN. But how should I approach this matter?
Note: The join I need is exclusively for dates, there is no equal key from t1 and t2 that I can use to join them.
Thanks!
Move the join condition to the WHERE clause. In this case LEFT JOIN is transformed to CROSS, because you do not have other join conditions, and join without conditions is CROSS-join. After the cross join, filter rows in the WHERE clause. Though CROSS join may cause serious performance issues if it is not possible to filter rows or join by some other key to avoid CROSS product. If one of the table is small enough to fit into memory, CROSS-join will be executed as map-join, this also will help to improve performance.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=512000000; --try to set it bigger and see if map-join works
--setting too big value may cause OOM exception
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
CROSS JOIN
(
SELECT *
FROM table2
WHERE other_conditions
) t2
WHERE (t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime)
OR t2.category is NULL --to allow absence of t2 like in LEFT join
;

Complex Nested SUM / Sub-Select

UPDATED with sample data etc.
I am a bit over my head on this complex query. Some background: This is a rails app and I have expenditures model which has many expenditure_items which each has an amount column - these all sum up to a total for the related expenditure.
A given expenditure can be an Order which then can have multiple (or single or nil) related Invoice expenditures. I am looking for a single query that jives me all the orders which have total invoices and identify those that have invoices totalling more than a threshold (in my case 10%).
I get the idea from my searching that I need a sub-select here but I can't sort it out. I apologize as raw SQL is not my wheel house - normal Rails Active Record calls meet 99% of my needs.
Sample Data:
=> SELECT * FROM expenditures WHERE id = 17;
id | category | parent_id
-----+----------------+----------
17 | purchase_order |
=> SELECT * FROM expenditures_items WHERE expenditure_id = 17;
id | amount
-----+-------------
1 | 1000.00
2 | 2000.00
I need to obtain the SUM ( expenditures.amount ) in my result - the original order of $3,000.00.
Related Expenditures (invoices)
=> SELECT * FROM expenditures WHERE category = 'invoice', parent_id = 17;
id | category | parent_id
-----+----------------+----------
46 | invoice | 17
88 | invoice | 17
=> SELECT * FROM expenditures_items WHERE expenditure_id IN (46, 88) ;
id | amount | expenditure_id
-----+----------+---------------
23 | 500.00 | 46
24 | 1000.00 | 46
78 | 550.00 | 88
79 | 1100.00 | 88
Order 17 has two invoices (46 & 88) totalling $3,150.00 - this is the SUM of all the invoice expenditure_item amounts.
In the end I am looking for the SQL that gets me something like this:
=> SELECT * FROM expenditures WHERE category = 'purchase_order';
id | category | expenditure_total | invoice_total | percent
-----+----------------+-------------------+---------------+---------
17 | purchase_order | 3000.00 | 3150.00 | 5
45 | purchase_order | 4000.00 | 3000.00 | -25
75 | purchase_order | 7000.00 | 7000.00 | 0
99 | purchase_order | 10000.00 | 11100.00 | 11
percent is invoice_total / expenditure_total - 1.
I also need to (perhaps a HAVING clause) filter out only the results that have a percent > a threshold (say 10).
From all my searching this seems to be a sub query along with some joins but I am lost at this point.
UPDATED Further
I had another look - this is close:
SELECT DISTINCT expenditures.*, SUM( invoice_items.amount ) as invoiced_total FROM "expenditures" JOIN expenditures AS invoices ON invoices.category = 'invoice' AND expenditures.id = CAST( invoices.ancestry AS INT) JOIN expenditure_items ON expenditure_items.expenditure_id = expenditures.id JOIN expenditure_items AS invoice_items ON invoice_items.expenditure_id = invoices.id WHERE "expenditures"."category" IN ($1, $2) GROUP BY expenditures.id HAVING (( SUM( invoice_items.amount ) / SUM( expenditure_items.amount ) ) > 1.1 ) [["category", "work_order"], ["category", "purchase_order"]]
Here is the odd thing - the invoiced_total in the select works. I get the proper amounts as per my example. The issue seems to be in my HAVING where it only pulls the SUM on the first invoice.
UPDATE 3
Soooooo close:
SELECT DISTINCT
expenditures.*,
( SELECT
SUM(expenditure_items.amount)
FROM expenditure_items
WHERE expenditure_items.expenditure_id = expenditures.id ) AS order_total,
( SELECT
SUM(expenditure_items.amount)
FROM expenditure_items
JOIN expenditures invoices ON expenditure_items.expenditure_id = invoices.id
AND CAST (invoices.ancestry AS INT) = expenditures.id ) AS invoice_total
FROM "expenditures"
INNER JOIN "expenditure_items" ON "expenditure_items"."expenditure_id" = "expenditures"."id"
WHERE "expenditures"."category" IN ("work_order", "purchase_order")
The only thing I can't get is eliminate the expenditures that either have no invoices or that are over my 10% rule. The first was in my old solution with the original join - I can't seem to figure out how to sum on that join data.
step-by-step demo:db<>fiddle
I am sure, there is a better solution, but this one should work:
WITH cte AS (
SELECT
e.id,
e.category,
COALESCE(parent_id, e.id) AS parent_id,
ei.amount
FROM
expenditures e
JOIN
expenditures_items ei ON e.id = ei.expenditure_id
),
cte2 AS (
SELECT
id,
SUM(amount) FILTER (WHERE category = 'purchase_order') AS expentiture_total,
SUM(amount) FILTER (WHERE category = 'invoice') AS invoice_total
FROM (
SELECT
parent_id AS id,
category,
SUM(amount) AS amount
FROM cte
GROUP BY (parent_id, category)
) s
GROUP BY id
)
SELECT
*,
(invoice_total/expentiture_total - 1) * 100 AS percent
FROM
cte2
The first CTE joins the both tables. The COALESCE() function mirrors the id as parent_id if the record has none (if category = 'purchase_order'). This can be used to do one single GROUP on this id and the category.
This is done within the second CTE (most inner subquery). [Btw: I choose the CTE variant because I find it much more readable. In this case you could do all steps as subqueries of course.] This group sums up the different categories for each (parent_)id.
The outer subquery is doing a pivot. It shifts the different records per category into your expected result with the help of a GROUP BY and the FILTER clause (Have a look at this step in the fiddle to understand it). Don't worry about the SUM() function here. Because of the GROUP BY, one aggregation function is necessary, but it does nothing, because the grouping has been already done.
Last step is calculating the percent value out of the pivoted table.

Postgres, Rails and selecting columns that are not in group clause

I have the following query in which I want to group by treatment_selections.treatment_id and select the treatments.name column to be called:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('DISTINCT ON (treatment_selections.treatment_id) treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
I get the dreaded error:
PG::GroupingError: ERROR: column "treatment_selections.treatment_id" must appear in the GROUP BY clause or be used in an aggregate function
So I switched to the following query:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatments.name, treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
.group('treatment_selections.treatment_id')
Which I know won't work because of not referencing treatments.name in the group clause. But I figured the top method should of worked as I'm not grouping by anything. I understand that using such methods as AVG and SUM are not needed to be referenced in the group clause, but what about columns that don't reference any aggregate functions?
I have seen that nesting queries is a possible way of doing what I'm after, but I'm unsure of how best to implement this using the above query. Hoping someone could help me out here.
Log
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield FROM "trials" INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id" INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id" INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id" INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id" WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018) GROUP BY treatment_selections.treatment_id)
Selecting multiple columns (without aggregation) and using aggregate functions together won't be possible, unless you group by the selected columns - otherwise there is no way to determine how the average should be calculated (entire data set vs grouped by something). You could do this -
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN ? AND ?", params[:start_year], params[:end_year])
.group('treatment_selections.treatment_id, treatments.name')
Although this might not work well for your use case if one treatments.id can be associated with mutiple treatment.name
I am not expert on Rails but lets analyze the logged query:
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
Maybe you are relying in the clause DISTINCT ON to make this work without specifying both columns. But as you see in the log, this is not being translated into SQL.
SELECT [missing DISTINCT ON(treatment_selections.treatment_id)] treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
But even if you managed to force Rails to implement DISTINCT ON, you might not get your intended result because DISTINCT ON should return only one row per treatment_id.
The standard SQL way is to specify both columns as grouping in the aggregation:
If it is the case that treatment_id has a 1:1 relationship to treatment_name, then if you run the query without the AVG function (and without DISTINCT ON), the data would look similar to:
| treatment_id | name | yield |
------------------------------------------------------
| 1 | treatment 1 | 0.50 |
| 1 | treatment 1 | 0.45 |
| 2 | treatment 2 | 0.65 |
| 2 | treatment 2 | 0.66 |
| 3 | treatment 3 | 0.85 |
Now to use the average function you must aggregate by (both) treatment_id and treatment_name.
The reason you must specify both is because the database manager assumes that all the columns in the resulting data set are not related among each other. So, aggregating by both columns
SELECT treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id, treatments.name
will give you the following result:
| treatment_id | name | AVG(yield) |
------------------------------------------------------------
| 1 | treatment 1 | 0.475 |
| 2 | treatment 2 | 0.655 |
| 3 | treatment 3 | 0.85 |
To understand this better, if the resulting data in the first two columns was not related; for example:
| year | name | yield |
-----------------------------------------------
| 2000 | treatment 1 | 0.1 |
| 2000 | treatment 1 | 0.2 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |
you must still group by year and name and, in this case, the average function would only be used when year and name are the same (note that it is not possible to do otherwise) resulting:
| year | name | AVG(yield) |
---------------------------------------------------
| 2000 | treatment 1 | 0.15 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |

Left Joins that link to multiple rows only returning one

I'm trying to join two table (call them table1 and table2) but only return 1 entry for each match. In table2, there is a column called 'current' that is either 'y', 'n', or 'null'. I have left joined the two tables and put a where clause to get me the 'y' and 'null' instances, those are easy. I need help to get the rows that join to rows that only have a 'n' to return one instance of a 'none' or 'null'. Here is an example
table1
ID
1
2
3
table2
ID | table1ID | current
1 | 1 | y
2 | 2 | null
3 | 3 | n
4 | 3 | n
5 | 3 | n
My current query joins on table1.ID=table2.table1ID and then has a where clause (where table2.current = 'y' or table2.current = 'null') but that doesn't work when there is no 'y' and the value isn't 'null'.
Can someone come up with a query that would join the table like I have but get me all 3 records from table1 like this?
Query Return
ID | table2ID | current
1 | 1 | y
2 | null | null
3 | 3 | null or none
First off, I'm assuming the "null" values are actually strings and not the DB value NULL.
If so, this query below should work (notice the inclusing of the where criteria INSIDE the ON sub-clause)
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current in ('y','null'))
If this does work, I would STRONGLY recommend changing the "null" string value to something else as it is entirely misleading... you or some other developer will lose time debugging this in the future.
If "null" acutally refers to the null value, then change the above query to:
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current = 'y' or table2.current is null))
you need to decide which of the three rows from table2 with table1id = 3 you want:
3 | 3 | n
4 | 3 | n
5 | 3 | n
what's the criterion?
select t1.id
, t2.id
, case when t2.count_current > 0 then
t2.count_current
else
null
end as current
from table1 t1
left outer join
(
select id
, max(table1id)
, sum(case when current = 'y' then 1 else 0 end) as count_current
from table2
group by id
) t2
on t1.id = t2.table1id
although, as justsomebody has pointed out, this may not work as you expect once you have multiple rows with 'y' in your table 2.

Resources