I have two subqueries that i'd like to join only by the date range between open and closed date from the first table.
First table example:
| id_original | open_datetime | close_datetime |
|-------------|-------------------|-------------------|
| 1 |2019-01-01 10:00:02|2019-01-02 11:00:21|
| 2 |2019-01-01 10:05:52|2019-01-05 16:45:12|
| 3 |2019-01-03 00:00:43|2019-01-03 23:12:44|
Second table example:
| category | all other columns...| open_date |
|----------|---------------------|-------------------|
| A | ... |2019-01-01 11:00:00|
| B | ... |2019-01-02 19:10:10|
| C | ... |2019-01-03 08:23:45|
| D | ... |2019-01-04 18:10:53|
Desired output:
| id_original | category | all other columns...| open_date |
|-------------|----------|---------------------|-------------------|
| 1 | A | ... |2019-01-01 11:00:00|
| 2 | A | ... |2019-01-01 11:00:00|
| 2 | B | ... |2019-01-02 19:10:10|
| 2 | C | ... |2019-01-03 08:23:45|
| 2 | D | ... |2019-01-04 18:10:53|
| 3 | C | ... |2019-01-03 08:23:45|
This is my code:
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
LEFT JOIN (
SELECT *
FROM table2
WHERE other_conditions
) t2 ON t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime
I know that Hive SQL doesn't support inequality as conditions for a JOIN. But how should I approach this matter?
Note: The join I need is exclusively for dates, there is no equal key from t1 and t2 that I can use to join them.
Thanks!
Move the join condition to the WHERE clause. In this case LEFT JOIN is transformed to CROSS, because you do not have other join conditions, and join without conditions is CROSS-join. After the cross join, filter rows in the WHERE clause. Though CROSS join may cause serious performance issues if it is not possible to filter rows or join by some other key to avoid CROSS product. If one of the table is small enough to fit into memory, CROSS-join will be executed as map-join, this also will help to improve performance.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=512000000; --try to set it bigger and see if map-join works
--setting too big value may cause OOM exception
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
CROSS JOIN
(
SELECT *
FROM table2
WHERE other_conditions
) t2
WHERE (t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime)
OR t2.category is NULL --to allow absence of t2 like in LEFT join
;
Say you have the following users table on PostgreSQL:
id | group_id | name | age
---|----------|---------|----
1 | 1 | adam | 10
2 | 1 | ben | 11
3 | 1 | charlie | 12 <-
3 | 2 | donnie | 20
4 | 2 | ewan | 21 <-
5 | 3 | fred | 30 <-
How can I query all columns only from the oldest user per group_id (those marked with an arrow)?
I've tried with group by, but keep hitting "users.id" must appear in the GROUP BY clause.
(Note: I have to work the query into a Rails AR model scope.)
After some digging, you can do use PostgreSQL's DISTINCT ON (col):
select distinct on (users.group_id) users.*
from users
order by users.group_id, users.age desc;
-- you might want to add extra column in ordering in case 2 users have the same age for same group_id
Translated in Rails, it would be:
User
.select('DISTINCT ON (users.group_id), users.*')
.order('users.group_id, users.age DESC')
Some doc about DISTINCT ON: https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT
Working example: https://www.db-fiddle.com/f/t4jeW4Sy91oxEfjMKYJpB1/0
You could use ROW_NUMBER/RANK(if ties are possible) windowed functions:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY group_id ORDER BY age DESC) AS rn
FROM tab) s
WHERE s.rn = 1;
you can use a subquery wuth aggreagated resul in join
select m.*
from users m
inner join (
select group_id, max(age) max_age
from users
group by group_id
) AS t on (t.group_id = m.group_id and t.max_age = m.age)
I'm using PostgreSQL 9.4. I've got a resources table which has the following columns:
id
name
provider
description
category
Let's say none of these columns is required (save for the id). I want resources to have a completion level, meaning that resources with NULL values for each column will be at 0% completion level.
Now, each column has a percentage weight. Let's say:
name: 40%
provider: 30%
description: 20%
category: 10%
So if a resource has a provider and a category, its completion level is at 60%.
These weight percentages could change at any time, so having a completion_level column which always has the value of the completion level will not work out (there could be million of resources). For example, at any moment, the percentage weight of description could decrease from 20% to 10% and category's from 10% to 20%. Maybe even other columns could be created and have their own weight.
The final objective is to be able to order resources by their completion levels.
I'm not sure how to approach this. I'm currently using Rails so almost all interaction with the database has been through the ORM, which I believe is not going to be much help in this case.
The only query I've found that somewhat resembles a solution (and not really) is to do something like the following:
SELECT * from resources
ORDER BY CASE name IS NOT NULL AND
provider IS NOT NULL AND
description is NOT NULL AND
category IS NOT NULL THEN 100
WHEN name is NULL AND provider IS NOT NULL...
However, there I must per mutate by every possible combination and that's pretty bad.
Add a weights table as in this SQL Fiddle:
PostgreSQL 9.6 Schema Setup:
CREATE TABLE resource_weights
( id int primary key check(id = 1)
, name numeric
, provider numeric
, description numeric
, category numeric);
INSERT INTO resource_weights
(id, name, provider, description, category)
VALUES
(1, .4, .3, .2, .1);
CREATE TABLE resources
( id int
, name varchar(50)
, provider varchar(50)
, description varchar(50)
, category varchar(50));
INSERT INTO resources
(id, name, provider, description, category)
VALUES
(1, 'abc', 'abc', 'abc', 'abc'),
(2, NULL, 'abc', 'abc', 'abc'),
(3, NULL, NULL, 'abc', 'abc'),
(4, NULL, 'abc', NULL, NULL);
Then calculate your weights at runtime like this
Query 1:
select r.*
, case when r.name is null then 0 else w.name end
+ case when r.provider is null then 0 else w.provider end
+ case when r.description is null then 0 else w.description end
+ case when r.category is null then 0 else w.category end weight
from resources r
cross join resource_weights w
order by weight desc
Results:
| id | name | provider | description | category | weight |
|----|--------|----------|-------------|----------|--------|
| 1 | abc | abc | abc | abc | 1 |
| 2 | (null) | abc | abc | abc | 0.6 |
| 3 | (null) | (null) | abc | abc | 0.3 |
| 4 | (null) | abc | (null) | (null) | 0.3 |
SQL's ORDER BY can order things by pretty much any expression; in particular, you can order by a sum. CASE is also fairly versatile (if somewhat verbose) and an expression so you can say things like:
case when name is not null then 40 else 0 end
which is more or less equivalent to name.nil?? 0 : 40 in Ruby.
Putting those together:
order by case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
Somewhat verbose but it'll do the right thing. Translating that into ActiveRecord is fairly easy:
query.order(Arel.sql(%q{
case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
}))
or in the other direction:
query.order(Arel.sql(%q{
case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
desc
}))
You'll need the Arel.sql call to avoid deprecation warnings in Rails 5.2+ as they don't want you to order(some_string) anymore, they just want you ordering by attributes unless you want to jump through some hoops to say that you really mean it.
Sum up weights like this:
SELECT * FROM resources
ORDER BY (CASE WHEN name IS NULL THEN 0 ELSE 40 END
+ CASE WHEN provider IS NULL THEN 0 ELSE 30 END
+ CASE WHEN description IS NULL THEN 0 ELSE 20 END
+ CASE WHEN category IS NULL THEN 0 ELSE 10 END) DESC;
This is how I would do it.
First: Weights
Since you say that the weights can chage from time to time, you have to create an structure to handle the changes. It could be a simple table. For this solution, it will be called weigths.
-- Table: weights
CREATE TABLE weights(id serial, table_nane text, column_name text, weight numeric(5,2));
id | table_name | column_name | weight
---+------------+--------------+--------
1 | resources | name | 40.00
2 | resources | provider | 30.00
3 | resources | description | 20.00
4 | resources | category | 10.00
So, when you need to change categories from 10 to 20 or/and description from 20 to 10, you update this structure.
Second: completion_level
Since you say that you could have millions of rows, it is ok to have completion_level column in the table resources; for efficiency purposes.
Making a query to get the completion_level works, you could have it in a view. But when you need the data fast and simple and you have MILLIONS of rows, it is better to set the data by "default" in a column or in another table.
When you have a view, every time you run it, it recreates the data. When you have it already on the table, it's fast and you don't have to recreate nothing, just query the data.
But how can you handle a completion_level? TRIGGERS
You would have to create a trigger for resources table. So, whenever you update or insert data, it will create the completion level.
First you add the column to the resources table
ALTER TABLE resources ADD COLUMN completion_level numeric(5,2);
And then you create the trigger:
CREATE OR REPLACE FUNCTION update_completion_level() RETURNS trigger AS $$
BEGIN
NEW.completion_level := (
CASE WHEN NEW.name IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='name') END
+ CASE WHEN NEW.provider IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='provider') END
+ CASE WHEN NEW.description IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='description') END
+ CASE WHEN NEW.category IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='category') END
);
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER resources_completion_level
BEFORE INSERT OR UPDATE
ON resources
FOR EACH ROW
EXECUTE PROCEDURE update_completion_level();
NOTE: table weights has a column called table_name; it's just in case you want to expand this functionality to other tables. In that case, you should update the trigger and add AND table_name='resources' in the query.
With this trigger, every time you update or insert you would have your completion_level ready so getting this data would be a simple query on resources table ;)
Third: What about old data and updates on weights?
Since the trigger only works for update and inserts, what about old data? or what if I change the weights of the columns?
Well, for those cases you could use a function to recreate all completion_level for every row.
CREATE OR REPLACE FUNCTION update_resources_completion_level() RETURNS void AS $$
BEGIN
UPDATE resources set completion_level = (
CASE WHEN name IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='name') END
+ CASE WHEN provider IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='provider') END
+ CASE WHEN description IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='description') END
+ CASE WHEN category IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='category') END
);
END $$ LANGUAGE plpgsql;
So everytime you update the weights or to update the OLD data, you just run the function
SELECT update_resources_completion_level();
Finally: What if I add columns?
Well, you would have to insert the new column in the weights table and update the functions (trigger and update_resources_completion_level()). Once everything is set, you run the function update_resources_completion_level() to set all weights acording to the changes :D
What query should I execute in MySQL database to get a result containing partial sums of source table?
For example when I have table:
Id|Val
1 | 1
2 | 2
3 | 3
4 | 4
I'd like to get result like this:
Id|Val
1 | 1
2 | 3 # 1+2
3 | 6 # 1+2+3
4 | 10 # 1+2+3+4
Right now I get this result with a stored procedure containing a cursor and while loops. I'd like to find a better way to do this.
You can do this by joining the table on itself. The SUM will add up all rows up to this row:
select cur.id, sum(prev.val)
from TheTable cur
left join TheTable prev
on cur.id >= prev.id
group by cur.id
MySQL also allows the use of user variables to calculate this, which is more efficient but considered something of a hack:
select
id
, #running_total := #running_total + val AS RunningTotal
from TheTable
SELECT l.Id, SUM(r.Val) AS Val
FROM your_table AS l
INNER JOIN your_table AS r
ON l.Val >= r.Val
GROUP BY l.Id
ORDER By l.Id
I'm trying to join two table (call them table1 and table2) but only return 1 entry for each match. In table2, there is a column called 'current' that is either 'y', 'n', or 'null'. I have left joined the two tables and put a where clause to get me the 'y' and 'null' instances, those are easy. I need help to get the rows that join to rows that only have a 'n' to return one instance of a 'none' or 'null'. Here is an example
table1
ID
1
2
3
table2
ID | table1ID | current
1 | 1 | y
2 | 2 | null
3 | 3 | n
4 | 3 | n
5 | 3 | n
My current query joins on table1.ID=table2.table1ID and then has a where clause (where table2.current = 'y' or table2.current = 'null') but that doesn't work when there is no 'y' and the value isn't 'null'.
Can someone come up with a query that would join the table like I have but get me all 3 records from table1 like this?
Query Return
ID | table2ID | current
1 | 1 | y
2 | null | null
3 | 3 | null or none
First off, I'm assuming the "null" values are actually strings and not the DB value NULL.
If so, this query below should work (notice the inclusing of the where criteria INSIDE the ON sub-clause)
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current in ('y','null'))
If this does work, I would STRONGLY recommend changing the "null" string value to something else as it is entirely misleading... you or some other developer will lose time debugging this in the future.
If "null" acutally refers to the null value, then change the above query to:
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current = 'y' or table2.current is null))
you need to decide which of the three rows from table2 with table1id = 3 you want:
3 | 3 | n
4 | 3 | n
5 | 3 | n
what's the criterion?
select t1.id
, t2.id
, case when t2.count_current > 0 then
t2.count_current
else
null
end as current
from table1 t1
left outer join
(
select id
, max(table1id)
, sum(case when current = 'y' then 1 else 0 end) as count_current
from table2
group by id
) t2
on t1.id = t2.table1id
although, as justsomebody has pointed out, this may not work as you expect once you have multiple rows with 'y' in your table 2.