How to join exclusively by date range in Hive SQL?

How to join exclusively by date range in Hive SQL? - join

I have two subqueries that i'd like to join only by the date range between open and closed date from the first table.
First table example:
| id_original | open_datetime | close_datetime |
|-------------|-------------------|-------------------|
| 1 |2019-01-01 10:00:02|2019-01-02 11:00:21|
| 2 |2019-01-01 10:05:52|2019-01-05 16:45:12|
| 3 |2019-01-03 00:00:43|2019-01-03 23:12:44|
Second table example:
| category | all other columns...| open_date |
|----------|---------------------|-------------------|
| A | ... |2019-01-01 11:00:00|
| B | ... |2019-01-02 19:10:10|
| C | ... |2019-01-03 08:23:45|
| D | ... |2019-01-04 18:10:53|
Desired output:
| id_original | category | all other columns...| open_date |
|-------------|----------|---------------------|-------------------|
| 1 | A | ... |2019-01-01 11:00:00|
| 2 | A | ... |2019-01-01 11:00:00|
| 2 | B | ... |2019-01-02 19:10:10|
| 2 | C | ... |2019-01-03 08:23:45|
| 2 | D | ... |2019-01-04 18:10:53|
| 3 | C | ... |2019-01-03 08:23:45|
This is my code:
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
LEFT JOIN (
SELECT *
FROM table2
WHERE other_conditions
) t2 ON t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime
I know that Hive SQL doesn't support inequality as conditions for a JOIN. But how should I approach this matter?
Note: The join I need is exclusively for dates, there is no equal key from t1 and t2 that I can use to join them.
Thanks!

Move the join condition to the WHERE clause. In this case LEFT JOIN is transformed to CROSS, because you do not have other join conditions, and join without conditions is CROSS-join. After the cross join, filter rows in the WHERE clause. Though CROSS join may cause serious performance issues if it is not possible to filter rows or join by some other key to avoid CROSS product. If one of the table is small enough to fit into memory, CROSS-join will be executed as map-join, this also will help to improve performance.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=512000000; --try to set it bigger and see if map-join works
--setting too big value may cause OOM exception
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
CROSS JOIN
(
SELECT *
FROM table2
WHERE other_conditions
) t2
WHERE (t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime)
OR t2.category is NULL --to allow absence of t2 like in LEFT join
;

Related

LibreOffice HSQLDB WHERE clause with LEFT JOIN and MAX?

I'm running macOS 11.6,LibreOffice 7.2.2.2,HSQLDB (my understanding is this is v.1.8, but don't know how to verify)
I'm a newbie to SQL, and I'm trying to write a DB to maintain a club membership roster. I'm trying to find everyone in the DB to whom renewal letters should be sent. The quirk is, if a person has never paid in the past, they should be sent a renewal letter. Old members who haven't renewed recently don't get a renewal, and obviously, each individual should only get one letter. I've created a toy example to display the problem I'm having...
Members table:
Key (Integer, Primary key, Autoincrement)
Name (Varchar)
+-----+----------+
| Key | Name |
+-----+----------+
| 0 | Abby |
| 1 | Bob |
| 2 | Dave |
| 3 | Ellen |
+-----+----------+
Payments table:
Key (Integer, Primary Key, autoincrement)
MemberKey (Integer, foreign key to Member table)
Payment Date (Date)
+-----+-----------+--------------+
| Key | MemberKey | Payment Date |
+-----+-----------+--------------+
| 0 | 0 | 2020-05-23 |
| 1 | 0 | 2021-06-12 |
| 2 | 1 | 2016-05-28 |
| 3 | 2 | 2020-07-02 |
+-----+-----------+--------------+
The only way I've found to include everyone is with a LEFT JOIN. The only way I've found to pick the most recent payment is with MAX. The following query produces a list of everyone's most recent payments, including people who've never paid:
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
GROUP BY "Members"."Key", "Members"."Name"
It returns the result below, which includes all members only once (Abby has 2 payments but only appears once with the most recent payment). Unfortunately it still includes people like Bob who've been out of the club so long that we don't want to send them a renewal notice.
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 1 | Bob | 2016-05-28 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
Where I hit a wall is when I try to perform any kind of conditional operation on the Last Payment, to determine whether it's recent enough to include in the list of renewal notices. For instance, in HSQLDB, the query below returns the error, "The data content could not be loaded. Not a condition." The only change in this query from the 1st one is the addition of the WHERE clause.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Last Payment" >= '2020-01-01'
GROUP BY "Members"."Key", "Members"."Name"
The desired output should look like this:
+-----+----------+--------------+
| Key | Name | Last Payment |
+-----+----------+--------------+
| 0 | Abby | 2021-06-12 |
| 2 | Dave | 2020-07-02 |
| 3 | Ellen | |
+-----+----------+--------------+
I've been digging around the web trying anything that looks relevant. I've tried "HAVING" clauses--I can make them work with a COUNT(*) function, but I can't make them work with a MAX(*) function. I've tried using my 1st query as a subquery, and applying the WHERE clause on "Last Payment" in the main query. I've tried solutions people say work in MySQL, but I can't get them to work in HSQLDB. I tried using the 1st query as a View, and writing a query against the View. I've tried a dozen other things I don't even remember. Everything past the 1st query above throws an error. I wanted to include my toy DB, but can't find a way to attach it to the post.
Can anyone help please?

This worked for me.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM {oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey"
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL}
GROUP BY "Members"."Key", "Members"."Name"
Result:
This works as well.
SELECT "Members"."Key", "Members"."Name", MAX( "Payments"."Payment Date" ) AS "Last Payment"
FROM { oj "Members" LEFT OUTER JOIN "Payments" ON "Members"."Key" = "Payments"."MemberKey" }
WHERE "Payments"."Payment Date" >= '2020-01-01'
OR "Payments"."Payment Date" IS NULL
GROUP BY "Members"."Key", "Members"."Name"
Perhaps the problem you were having is that "Last Payment" is only a column title and not the actual name of any column.

Postgres, Rails and selecting columns that are not in group clause

I have the following query in which I want to group by treatment_selections.treatment_id and select the treatments.name column to be called:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('DISTINCT ON (treatment_selections.treatment_id) treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
I get the dreaded error:
PG::GroupingError: ERROR: column "treatment_selections.treatment_id" must appear in the GROUP BY clause or be used in an aggregate function
So I switched to the following query:
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatments.name, treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")
.group('treatment_selections.treatment_id')
Which I know won't work because of not referencing treatments.name in the group clause. But I figured the top method should of worked as I'm not grouping by anything. I understand that using such methods as AVG and SUM are not needed to be referenced in the group clause, but what about columns that don't reference any aggregate functions?
I have seen that nesting queries is a possible way of doing what I'm after, but I'm unsure of how best to implement this using the above query. Hoping someone could help me out here.
Log
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield FROM "trials" INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id" INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id" INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id" INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id" WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018) GROUP BY treatment_selections.treatment_id)

Selecting multiple columns (without aggregation) and using aggregate functions together won't be possible, unless you group by the selected columns - otherwise there is no way to determine how the average should be calculated (entire data set vs grouped by something). You could do this -
#search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN ? AND ?", params[:start_year], params[:end_year])
.group('treatment_selections.treatment_id, treatments.name')
Although this might not work well for your use case if one treatments.id can be associated with mutiple treatment.name

I am not expert on Rails but lets analyze the logged query:
SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
Maybe you are relying in the clause DISTINCT ON to make this work without specifying both columns. But as you see in the log, this is not being translated into SQL.
SELECT [missing DISTINCT ON(treatment_selections.treatment_id)] treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id
But even if you managed to force Rails to implement DISTINCT ON, you might not get your intended result because DISTINCT ON should return only one row per treatment_id.
The standard SQL way is to specify both columns as grouping in the aggregation:
If it is the case that treatment_id has a 1:1 relationship to treatment_name, then if you run the query without the AVG function (and without DISTINCT ON), the data would look similar to:
| treatment_id | name | yield |
------------------------------------------------------
| 1 | treatment 1 | 0.50 |
| 1 | treatment 1 | 0.45 |
| 2 | treatment 2 | 0.65 |
| 2 | treatment 2 | 0.66 |
| 3 | treatment 3 | 0.85 |
Now to use the average function you must aggregate by (both) treatment_id and treatment_name.
The reason you must specify both is because the database manager assumes that all the columns in the resulting data set are not related among each other. So, aggregating by both columns
SELECT treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id, treatments.name
will give you the following result:
| treatment_id | name | AVG(yield) |
------------------------------------------------------------
| 1 | treatment 1 | 0.475 |
| 2 | treatment 2 | 0.655 |
| 3 | treatment 3 | 0.85 |
To understand this better, if the resulting data in the first two columns was not related; for example:
| year | name | yield |
-----------------------------------------------
| 2000 | treatment 1 | 0.1 |
| 2000 | treatment 1 | 0.2 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |
you must still group by year and name and, in this case, the average function would only be used when year and name are the same (note that it is not possible to do otherwise) resulting:
| year | name | AVG(yield) |
---------------------------------------------------
| 2000 | treatment 1 | 0.15 |
| 2000 | treatment 2 | 0.3 |
| 2000 | treatment 3 | 0.4 |
| 2001 | treatment 2 | 0.5 |
| 2001 | treatment 3 | 0.6 |
| 2002 | treatment 3 | 0.7 |

What is the best way to attach a running total to selected row data?

I have a table that looks like this:
Created at | Amount | Register Name
--------------+---------+-----------------
01/01/2019... | -150.01 | Front
01/01/2019... | 38.10 | Back
What is the best way to attach an ascending-by-date running total to each record which applies only to the register name the record has? I can do this in Ruby, but doing it in the database will be much faster as it is a web application.
The application is a Rails application running Postgres 10, although the answer can be Rails-agnostic of course.

Use the aggregate sum() as a window function, e.g.:
with my_table (created_at, amount, register_name) as (
values
('2019-01-01', -150.01, 'Front'),
('2019-01-01', 38.10, 'Back'),
('2019-01-02', -150.01, 'Front'),
('2019-01-02', 38.10, 'Back')
)
select
created_at, amount, register_name,
sum(amount) over (partition by register_name order by created_at)
from my_table
order by created_at, register_name;
created_at | amount | register_name | sum
------------+---------+---------------+---------
2019-01-01 | 38.10 | Back | 38.10
2019-01-01 | -150.01 | Front | -150.01
2019-01-02 | 38.10 | Back | 76.20
2019-01-02 | -150.01 | Front | -300.02
(4 rows)

Query only records with max value within a group

Say you have the following users table on PostgreSQL:
id | group_id | name | age
---|----------|---------|----
1 | 1 | adam | 10
2 | 1 | ben | 11
3 | 1 | charlie | 12 <-
3 | 2 | donnie | 20
4 | 2 | ewan | 21 <-
5 | 3 | fred | 30 <-
How can I query all columns only from the oldest user per group_id (those marked with an arrow)?
I've tried with group by, but keep hitting "users.id" must appear in the GROUP BY clause.
(Note: I have to work the query into a Rails AR model scope.)

After some digging, you can do use PostgreSQL's DISTINCT ON (col):
select distinct on (users.group_id) users.*
from users
order by users.group_id, users.age desc;
-- you might want to add extra column in ordering in case 2 users have the same age for same group_id
Translated in Rails, it would be:
User
.select('DISTINCT ON (users.group_id), users.*')
.order('users.group_id, users.age DESC')
Some doc about DISTINCT ON: https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT
Working example: https://www.db-fiddle.com/f/t4jeW4Sy91oxEfjMKYJpB1/0

You could use ROW_NUMBER/RANK(if ties are possible) windowed functions:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY group_id ORDER BY age DESC) AS rn
FROM tab) s
WHERE s.rn = 1;

you can use a subquery wuth aggreagated resul in join
select m.*
from users m
inner join (
select group_id, max(age) max_age
from users
group by group_id
) AS t on (t.group_id = m.group_id and t.max_age = m.age)

Left Joins that link to multiple rows only returning one

I'm trying to join two table (call them table1 and table2) but only return 1 entry for each match. In table2, there is a column called 'current' that is either 'y', 'n', or 'null'. I have left joined the two tables and put a where clause to get me the 'y' and 'null' instances, those are easy. I need help to get the rows that join to rows that only have a 'n' to return one instance of a 'none' or 'null'. Here is an example
table1
ID
1
2
3
table2
ID | table1ID | current
1 | 1 | y
2 | 2 | null
3 | 3 | n
4 | 3 | n
5 | 3 | n
My current query joins on table1.ID=table2.table1ID and then has a where clause (where table2.current = 'y' or table2.current = 'null') but that doesn't work when there is no 'y' and the value isn't 'null'.
Can someone come up with a query that would join the table like I have but get me all 3 records from table1 like this?
Query Return
ID | table2ID | current
1 | 1 | y
2 | null | null
3 | 3 | null or none

First off, I'm assuming the "null" values are actually strings and not the DB value NULL.
If so, this query below should work (notice the inclusing of the where criteria INSIDE the ON sub-clause)
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current in ('y','null'))
If this does work, I would STRONGLY recommend changing the "null" string value to something else as it is entirely misleading... you or some other developer will lose time debugging this in the future.
If "null" acutally refers to the null value, then change the above query to:
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current = 'y' or table2.current is null))

you need to decide which of the three rows from table2 with table1id = 3 you want:
3 | 3 | n
4 | 3 | n
5 | 3 | n
what's the criterion?

select t1.id
, t2.id
, case when t2.count_current > 0 then
t2.count_current
else
null
end as current
from table1 t1
left outer join
(
select id
, max(table1id)
, sum(case when current = 'y' then 1 else 0 end) as count_current
from table2
group by id
) t2
on t1.id = t2.table1id
although, as justsomebody has pointed out, this may not work as you expect once you have multiple rows with 'y' in your table 2.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to join exclusively by date range in Hive SQL? - join

Related

LibreOffice HSQLDB WHERE clause with LEFT JOIN and MAX?

Postgres, Rails and selecting columns that are not in group clause

What is the best way to attach a running total to selected row data?

Query only records with max value within a group

Left Joins that link to multiple rows only returning one

Categories

Resources