Postgres, Rails and selecting columns that are not in group clause

Postgres, Rails and selecting columns that are not in group clause - ruby-on-rails

I have the following query in which I want to group by treatment_selections.treatment_id and select the treatments.name column to be called:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('DISTINCT ON (treatment_selections.treatment_id) treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
I get the dreaded error:
PG::GroupingError: ERROR: column "treatment_selections.treatment_id" must appear in the GROUP BY clause or be used in an aggregate function
So I switched to the following query:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatments.name, treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
.group('treatment_selections.treatment_id')
Which I know won't work because of not referencing treatments.name in the group clause. But I figured the top method should of worked as I'm not grouping by anything. I understand that using such methods as AVG and SUM are not needed to be referenced in the group clause, but what about columns that don't reference any aggregate functions?
I have seen that nesting queries is a possible way of doing what I'm after, but I'm unsure of how best to implement this using the above query. Hoping someone could help me out here.
Log
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield FROM "trials" INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id" INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id" INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id" INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id" WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018) GROUP BY treatment_selections.treatment_id)

Selecting multiple columns (without aggregation) and using aggregate functions together won't be possible, unless you group by the selected columns - otherwise there is no way to determine how the average should be calculated (entire data set vs grouped by something). You could do this -
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN ? AND ?", params[:start_year], params[:end_year])
.group('treatment_selections.treatment_id, treatments.name')
Although this might not work well for your use case if one treatments.id can be associated with mutiple treatment.name

I am not expert on Rails but lets analyze the logged query:
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
Maybe you are relying in the clause DISTINCT ON to make this work without specifying both columns. But as you see in the log, this is not being translated into SQL.
SELECT [missing DISTINCT ON(treatment_selections.treatment_id)] treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
But even if you managed to force Rails to implement DISTINCT ON, you might not get your intended result because DISTINCT ON should return only one row per treatment_id.
The standard SQL way is to specify both columns as grouping in the aggregation:
If it is the case that treatment_id has a 1:1 relationship to treatment_name, then if you run the query without the AVG function (and without DISTINCT ON), the data would look similar to:
| treatment_id | name | yield |
------------------------------------------------------
| 1 | treatment 1 | 0.50 |
| 1 | treatment 1 | 0.45 |
| 2 | treatment 2 | 0.65 |
| 2 | treatment 2 | 0.66 |
| 3 | treatment 3 | 0.85 |
Now to use the average function you must aggregate by (both) treatment_id and treatment_name.
The reason you must specify both is because the database manager assumes that all the columns in the resulting data set are not related among each other. So, aggregating by both columns
SELECT treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id, treatments.name
will give you the following result:
| treatment_id | name | AVG(yield) |
------------------------------------------------------------
| 1 | treatment 1 | 0.475 |
| 2 | treatment 2 | 0.655 |
| 3 | treatment 3 | 0.85 |
To understand this better, if the resulting data in the first two columns was not related; for example:
| year | name | yield |
-----------------------------------------------
| 2000 | treatment 1 | 0.1 |
| 2000 | treatment 1 | 0.2 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |
you must still group by year and name and, in this case, the average function would only be used when year and name are the same (note that it is not possible to do otherwise) resulting:
| year | name | AVG(yield) |
---------------------------------------------------
| 2000 | treatment 1 | 0.15 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |

Related

LibreOffice HSQLDB WHERE clause with LEFT JOIN and MAX?

I'm running macOS 11.6,LibreOffice 7.2.2.2,HSQLDB (my understanding is this is v.1.8, but don't know how to verify)
I'm a newbie to SQL, and I'm trying to write a DB to maintain a club membership roster. I'm trying to find everyone in the DB to whom renewal letters should be sent. The quirk is, if a person has never paid in the past, they should be sent a renewal letter. Old members who haven't renewed recently don't get a renewal, and obviously, each individual should only get one letter. I've created a toy example to display the problem I'm having...
Members table:
Key (Integer, Primary key, Autoincrement)
Name (Varchar)
+-----+----------+
| Key | Name |
+-----+----------+
| 0 | Abby |
| 1 | Bob |
| 2 | Dave |
| 3 | Ellen |
+-----+----------+
Payments table:
Key (Integer, Primary Key, autoincrement)
MemberKey (Integer, foreign key to Member table)
Payment Date (Date)
+-----+-----------+--------------+
| Key | MemberKey | Payment Date |
+-----+-----------+--------------+
| 0 | 0 | 2020-05-23 |
| 1 | 0 | 2021-06-12 |
| 2 | 1 | 2016-05-28 |
| 3 | 2 | 2020-07-02 |
+-----+-----------+--------------+
The only way I've found to include everyone is with a LEFT JOIN. The only way I've found to pick the most recent payment is with MAX. The following query produces a list of everyone's most recent payments, including people who've never paid:
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
GROUP BY "Members"."Key", "Members"."Name"
It returns the result below, which includes all members only once (Abby has 2 payments but only appears once with the most recent payment). Unfortunately it still includes people like Bob who've been out of the club so long that we don't want to send them a renewal notice.
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 1 | Bob | 2016-05-28 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
Where I hit a wall is when I try to perform any kind of conditional operation on the Last Payment, to determine whether it's recent enough to include in the list of renewal notices. For instance, in HSQLDB, the query below returns the error, "The data content could not be loaded. Not a condition." The only change in this query from the 1st one is the addition of the WHERE clause.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Last Payment" >= '2020-01-01'
GROUP BY "Members"."Key", "Members"."Name"
The desired output should look like this:
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
I've been digging around the web trying anything that looks relevant. I've tried "HAVING" clauses--I can make them work with a COUNT(*) function, but I can't make them work with a MAX(*) function. I've tried using my 1st query as a subquery, and applying the WHERE clause on "Last Payment" in the main query. I've tried solutions people say work in MySQL, but I can't get them to work in HSQLDB. I tried using the 1st query as a View, and writing a query against the View. I've tried a dozen other things I don't even remember. Everything past the 1st query above throws an error. I wanted to include my toy DB, but can't find a way to attach it to the post.
Can anyone help please?

This worked for me.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM {oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey"
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL}
GROUP BY "Members"."Key", "Members"."Name"
Result:
This works as well.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL
GROUP BY "Members"."Key", "Members"."Name"
Perhaps the problem you were having is that "Last Payment" is only a column title and not the actual name of any column.

With a composite index, what column order do ActiveRecord queries use to decide which composite index to search?

Rails v. 5.2.4
ActiveRecord v5.2.4.3
I have a Rails app with a MySQL database, and my app has a Skill model and a SkillAdjacency model. The SkillAdjacency model has the following attributes:
requested_skill_id, table_name: 'Skill'
adjacent_skill_id, table_name: 'Skill'
score, integer
SkillAdjacencies are used to determine how "similar" two instances of Skill are to each other.
One of the app's constraints is that you can't create more than one instance of SkillAdjacency for each combination of requested_skill and adjacent_skill, and I plan to enforce this both with ActiveModel validations and with a composite index which employs a uniqueness constraint. So far I have the following:
add_index :skill_adjacencies, [:requested_skill_id, :adjacent_skill_id], unique: true, name: 'index_adjacencies_on_requested_then_adjacent', using: :btree
However, I know that the order in which the composite columns are declared is important, so I'm considering adding this 2nd composite index to account for the other possible order:
add_index :skill_adjacencies, [:adjacent_skill_id, :requested_skill_id], unique: true, name: 'index_adjacencies_on_adjacent_then_requested', using: :btree
But because writing to an index isn't free, I only want to add the 2nd index if it will actually result in a performance benefit. The problem is, whether or not this 2nd index will be beneficial depends on whether ActiveRecord will start with adjacent_skill_id vs. requested_skill_id when searching for a composite index to search.
How can I determine what order ActiveRecord uses? Does it just use the same order that's specified in the query? For example, if I query SkillAdjacency.where(requested_skill: Skill.last, adjacent_skill: Skill.first), will it always search for a composite index composed of requested_skill 1st and adjacent_skill 2nd? If that's the case, should I cover all my bases by creating that additional composite index?
Alternately, is there some under-the-hood magic which determines if the relevant composite index exists regardless of the order provided in the query?
EDIT:
I ran EXPLAIN and saw the following:
irb(main):013:0> SkillAdjacency.where(requested_skill_id: 1, adjacent_skill_id: 200).explain
SkillAdjacency Load (0.3ms) SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`requested_skill_id` = 1 AND `skill_adjacencies`.`adjacent_skill_id` = 200
=> EXPLAIN for: SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`requested_skill_id` = 1 AND `skill_adjacencies`.`adjacent_skill_id` = 200
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | skill_adjacencies | NULL | const | index_adjacencies_on_requested_then_adjacent,index_adjacencies_on_adjacent_then_requested,index_skill_adjacencies_on_requested_skill_id,index_skill_adjacencies_on_adjacent_skill_id | index_adjacencies_on_requested_then_adjacent | 10 | const,const | 1 | 100.0 | NULL |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
1 row in set (0.00 sec)
irb(main):014:0> SkillAdjacency.where(adjacent_skill_id: 200, requested_skill: 1).explain
SkillAdjacency Load (0.3ms) SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`adjacent_skill_id` = 200 AND `skill_adjacencies`.`requested_skill_id` = 1
=> EXPLAIN for: SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`adjacent_skill_id` = 200 AND `skill_adjacencies`.`requested_skill_id` = 1
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | skill_adjacencies | NULL | const | index_adjacencies_on_requested_then_adjacent,index_adjacencies_on_adjacent_then_requested,index_skill_adjacencies_on_requested_skill_id,index_skill_adjacencies_on_adjacent_skill_id | index_adjacencies_on_requested_then_adjacent | 10 | const,const | 1 | 100.0 | NULL |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
1 row in set (0.00 sec)
In both cases, I see that the value in the key column is index_adjacencies_on_requested_then_adjacent, despite each query passing in a different order for the query params. Can I assume this means the order of those params doesn't matter?

How to join exclusively by date range in Hive SQL?

I have two subqueries that i'd like to join only by the date range between open and closed date from the first table.
First table example:
| id_original | open_datetime | close_datetime |
|-------------|-------------------|-------------------|
| 1 |2019-01-01 10:00:02|2019-01-02 11:00:21|
| 2 |2019-01-01 10:05:52|2019-01-05 16:45:12|
| 3 |2019-01-03 00:00:43|2019-01-03 23:12:44|
Second table example:
| category | all other columns...| open_date |
|----------|---------------------|-------------------|
| A | ... |2019-01-01 11:00:00|
| B | ... |2019-01-02 19:10:10|
| C | ... |2019-01-03 08:23:45|
| D | ... |2019-01-04 18:10:53|
Desired output:
| id_original | category | all other columns...| open_date |
|-------------|----------|---------------------|-------------------|
| 1 | A | ... |2019-01-01 11:00:00|
| 2 | A | ... |2019-01-01 11:00:00|
| 2 | B | ... |2019-01-02 19:10:10|
| 2 | C | ... |2019-01-03 08:23:45|
| 2 | D | ... |2019-01-04 18:10:53|
| 3 | C | ... |2019-01-03 08:23:45|
This is my code:
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
LEFT JOIN (
SELECT *
FROM table2
WHERE other_conditions
) t2 ON t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime
I know that Hive SQL doesn't support inequality as conditions for a JOIN. But how should I approach this matter?
Note: The join I need is exclusively for dates, there is no equal key from t1 and t2 that I can use to join them.
Thanks!

Move the join condition to the WHERE clause. In this case LEFT JOIN is transformed to CROSS, because you do not have other join conditions, and join without conditions is CROSS-join. After the cross join, filter rows in the WHERE clause. Though CROSS join may cause serious performance issues if it is not possible to filter rows or join by some other key to avoid CROSS product. If one of the table is small enough to fit into memory, CROSS-join will be executed as map-join, this also will help to improve performance.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=512000000; --try to set it bigger and see if map-join works
--setting too big value may cause OOM exception
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
CROSS JOIN
(
SELECT *
FROM table2
WHERE other_conditions
) t2
WHERE (t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime)
OR t2.category is NULL --to allow absence of t2 like in LEFT join
;

Query only records with max value within a group

Say you have the following users table on PostgreSQL:
id | group_id | name | age
---|----------|---------|----
1 | 1 | adam | 10
2 | 1 | ben | 11
3 | 1 | charlie | 12 <-
3 | 2 | donnie | 20
4 | 2 | ewan | 21 <-
5 | 3 | fred | 30 <-
How can I query all columns only from the oldest user per group_id (those marked with an arrow)?
I've tried with group by, but keep hitting "users.id" must appear in the GROUP BY clause.
(Note: I have to work the query into a Rails AR model scope.)

After some digging, you can do use PostgreSQL's DISTINCT ON (col):
select distinct on (users.group_id) users.*
from users
order by users.group_id, users.age desc;
-- you might want to add extra column in ordering in case 2 users have the same age for same group_id
Translated in Rails, it would be:
User
.select('DISTINCT ON (users.group_id), users.*')
.order('users.group_id, users.age DESC')
Some doc about DISTINCT ON: https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT
Working example: https://www.db-fiddle.com/f/t4jeW4Sy91oxEfjMKYJpB1/0

You could use ROW_NUMBER/RANK(if ties are possible) windowed functions:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY group_id ORDER BY age DESC) AS rn
FROM tab) s
WHERE s.rn = 1;

you can use a subquery wuth aggreagated resul in join
select m.*
from users m
inner join (
select group_id, max(age) max_age
from users
group by group_id
) AS t on (t.group_id = m.group_id and t.max_age = m.age)

select distinct records based on one field while keeping other fields intact

I've got a table like this:
table: searches
+------------------------------+
| id | address | date |
+------------------------------+
| 1 | 123 foo st | 03/01/13 |
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
| 5 | 456 foo st | 03/01/13 |
| 6 | 567 foo st | 03/01/13 |
+------------------------------+
And want a result set like this:
+------------------------------+
| id | address | date |
+------------------------------+
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
+------------------------------+
But ActiveRecord seems unable to achieve this result. Here's what I'm trying:
Model has a 'most_recent' scope: scope :most_recent, order('date_searched DESC')
Model.most_recent.uniq returns the full set (SELECT DISTINCT "searches".* FROM "searches" ORDER BY date DESC) -- obviously the query is not going to do what I want, but neither is selecting only one column. I need all columns, but only rows where the address is unique in the result set.
I could do something like Model.select('distinct(address), date, id'), but that feels...wrong.

You could do a
select max(id), address, max(date) as latest
from searches
group by address
order by latest desc
According to sqlfiddle that does exactly what I think you want.
It's not quite the same as your requirement output, which doesn't seem to care about which ID is returned. Still, the query needs to specify something, which is here done by the "max" aggregate function.
I don't think you'll have any luck with ActiveRecord's autogenerated query methods for this case. So just add your own query method using that SQL to your model class. It's completely standard SQL that'll also run on basically any other RDBMS.
Edit: One big weakness of the query is that it doesn't necessarily return actual records. If the highest ID for a given address doesn't corellate with the highest date for that address, the resulting "record" will be different from the one actually stored in the DB. Depending on the use case that might matter or not. For Mysql simply changing max(id) to id would fix that problem, but IIRC Oracle has a problem with that.

To show unique addresses:
Searches.group(:address)
Then you can select columns if you want:
Searches.group(:address).select('id,date')

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Postgres, Rails and selecting columns that are not in group clause - ruby-on-rails

Related

LibreOffice HSQLDB WHERE clause with LEFT JOIN and MAX?

With a composite index, what column order do ActiveRecord queries use to decide which composite index to search?

How to join exclusively by date range in Hive SQL?

Query only records with max value within a group

select distinct records based on one field while keeping other fields intact

Categories

Resources