Complex Nested SUM / Sub-Select - ruby-on-rails

UPDATED with sample data etc.
I am a bit over my head on this complex query. Some background: This is a rails app and I have expenditures model which has many expenditure_items which each has an amount column - these all sum up to a total for the related expenditure.
A given expenditure can be an Order which then can have multiple (or single or nil) related Invoice expenditures. I am looking for a single query that jives me all the orders which have total invoices and identify those that have invoices totalling more than a threshold (in my case 10%).
I get the idea from my searching that I need a sub-select here but I can't sort it out. I apologize as raw SQL is not my wheel house - normal Rails Active Record calls meet 99% of my needs.
Sample Data:
=> SELECT * FROM expenditures WHERE id = 17;
id | category | parent_id
-----+----------------+----------
17 | purchase_order |
=> SELECT * FROM expenditures_items WHERE expenditure_id = 17;
id | amount
-----+-------------
1 | 1000.00
2 | 2000.00
I need to obtain the SUM ( expenditures.amount ) in my result - the original order of $3,000.00.
Related Expenditures (invoices)
=> SELECT * FROM expenditures WHERE category = 'invoice', parent_id = 17;
id | category | parent_id
-----+----------------+----------
46 | invoice | 17
88 | invoice | 17
=> SELECT * FROM expenditures_items WHERE expenditure_id IN (46, 88) ;
id | amount | expenditure_id
-----+----------+---------------
23 | 500.00 | 46
24 | 1000.00 | 46
78 | 550.00 | 88
79 | 1100.00 | 88
Order 17 has two invoices (46 & 88) totalling $3,150.00 - this is the SUM of all the invoice expenditure_item amounts.
In the end I am looking for the SQL that gets me something like this:
=> SELECT * FROM expenditures WHERE category = 'purchase_order';
id | category | expenditure_total | invoice_total | percent
-----+----------------+-------------------+---------------+---------
17 | purchase_order | 3000.00 | 3150.00 | 5
45 | purchase_order | 4000.00 | 3000.00 | -25
75 | purchase_order | 7000.00 | 7000.00 | 0
99 | purchase_order | 10000.00 | 11100.00 | 11
percent is invoice_total / expenditure_total - 1.
I also need to (perhaps a HAVING clause) filter out only the results that have a percent > a threshold (say 10).
From all my searching this seems to be a sub query along with some joins but I am lost at this point.
UPDATED Further
I had another look - this is close:
SELECT DISTINCT expenditures.*, SUM( invoice_items.amount ) as invoiced_total FROM "expenditures" JOIN expenditures AS invoices ON invoices.category = 'invoice' AND expenditures.id = CAST( invoices.ancestry AS INT) JOIN expenditure_items ON expenditure_items.expenditure_id = expenditures.id JOIN expenditure_items AS invoice_items ON invoice_items.expenditure_id = invoices.id WHERE "expenditures"."category" IN ($1, $2) GROUP BY expenditures.id HAVING (( SUM( invoice_items.amount ) / SUM( expenditure_items.amount ) ) > 1.1 ) [["category", "work_order"], ["category", "purchase_order"]]
Here is the odd thing - the invoiced_total in the select works. I get the proper amounts as per my example. The issue seems to be in my HAVING where it only pulls the SUM on the first invoice.
UPDATE 3
Soooooo close:
SELECT DISTINCT
expenditures.*,
( SELECT
SUM(expenditure_items.amount)
FROM expenditure_items
WHERE expenditure_items.expenditure_id = expenditures.id ) AS order_total,
( SELECT
SUM(expenditure_items.amount)
FROM expenditure_items
JOIN expenditures invoices ON expenditure_items.expenditure_id = invoices.id
AND CAST (invoices.ancestry AS INT) = expenditures.id ) AS invoice_total
FROM "expenditures"
INNER JOIN "expenditure_items" ON "expenditure_items"."expenditure_id" = "expenditures"."id"
WHERE "expenditures"."category" IN ("work_order", "purchase_order")
The only thing I can't get is eliminate the expenditures that either have no invoices or that are over my 10% rule. The first was in my old solution with the original join - I can't seem to figure out how to sum on that join data.

step-by-step demo:db<>fiddle
I am sure, there is a better solution, but this one should work:
WITH cte AS (
SELECT
e.id,
e.category,
COALESCE(parent_id, e.id) AS parent_id,
ei.amount
FROM
expenditures e
JOIN
expenditures_items ei ON e.id = ei.expenditure_id
),
cte2 AS (
SELECT
id,
SUM(amount) FILTER (WHERE category = 'purchase_order') AS expentiture_total,
SUM(amount) FILTER (WHERE category = 'invoice') AS invoice_total
FROM (
SELECT
parent_id AS id,
category,
SUM(amount) AS amount
FROM cte
GROUP BY (parent_id, category)
) s
GROUP BY id
)
SELECT
*,
(invoice_total/expentiture_total - 1) * 100 AS percent
FROM
cte2
The first CTE joins the both tables. The COALESCE() function mirrors the id as parent_id if the record has none (if category = 'purchase_order'). This can be used to do one single GROUP on this id and the category.
This is done within the second CTE (most inner subquery). [Btw: I choose the CTE variant because I find it much more readable. In this case you could do all steps as subqueries of course.] This group sums up the different categories for each (parent_)id.
The outer subquery is doing a pivot. It shifts the different records per category into your expected result with the help of a GROUP BY and the FILTER clause (Have a look at this step in the fiddle to understand it). Don't worry about the SUM() function here. Because of the GROUP BY, one aggregation function is necessary, but it does nothing, because the grouping has been already done.
Last step is calculating the percent value out of the pivoted table.

Related

How to join exclusively by date range in Hive SQL?

I have two subqueries that i'd like to join only by the date range between open and closed date from the first table.
First table example:
| id_original | open_datetime | close_datetime |
|-------------|-------------------|-------------------|
| 1 |2019-01-01 10:00:02|2019-01-02 11:00:21|
| 2 |2019-01-01 10:05:52|2019-01-05 16:45:12|
| 3 |2019-01-03 00:00:43|2019-01-03 23:12:44|
Second table example:
| category | all other columns...| open_date |
|----------|---------------------|-------------------|
| A | ... |2019-01-01 11:00:00|
| B | ... |2019-01-02 19:10:10|
| C | ... |2019-01-03 08:23:45|
| D | ... |2019-01-04 18:10:53|
Desired output:
| id_original | category | all other columns...| open_date |
|-------------|----------|---------------------|-------------------|
| 1 | A | ... |2019-01-01 11:00:00|
| 2 | A | ... |2019-01-01 11:00:00|
| 2 | B | ... |2019-01-02 19:10:10|
| 2 | C | ... |2019-01-03 08:23:45|
| 2 | D | ... |2019-01-04 18:10:53|
| 3 | C | ... |2019-01-03 08:23:45|
This is my code:
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
LEFT JOIN (
SELECT *
FROM table2
WHERE other_conditions
) t2 ON t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime
I know that Hive SQL doesn't support inequality as conditions for a JOIN. But how should I approach this matter?
Note: The join I need is exclusively for dates, there is no equal key from t1 and t2 that I can use to join them.
Thanks!
Move the join condition to the WHERE clause. In this case LEFT JOIN is transformed to CROSS, because you do not have other join conditions, and join without conditions is CROSS-join. After the cross join, filter rows in the WHERE clause. Though CROSS join may cause serious performance issues if it is not possible to filter rows or join by some other key to avoid CROSS product. If one of the table is small enough to fit into memory, CROSS-join will be executed as map-join, this also will help to improve performance.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=512000000; --try to set it bigger and see if map-join works
--setting too big value may cause OOM exception
SELECT *
FROM (
SELECT id, open_datetime, close_datetime
FROM table1
WHERE id IN (list_of_ids)
) t1
CROSS JOIN
(
SELECT *
FROM table2
WHERE other_conditions
) t2
WHERE (t2.open_date >= t1.open_datetime AND t2.open_date <= t1.close_datetime)
OR t2.category is NULL --to allow absence of t2 like in LEFT join
;

Query only records with max value within a group

Say you have the following users table on PostgreSQL:
id | group_id | name | age
---|----------|---------|----
1 | 1 | adam | 10
2 | 1 | ben | 11
3 | 1 | charlie | 12 <-
3 | 2 | donnie | 20
4 | 2 | ewan | 21 <-
5 | 3 | fred | 30 <-
How can I query all columns only from the oldest user per group_id (those marked with an arrow)?
I've tried with group by, but keep hitting "users.id" must appear in the GROUP BY clause.
(Note: I have to work the query into a Rails AR model scope.)
After some digging, you can do use PostgreSQL's DISTINCT ON (col):
select distinct on (users.group_id) users.*
from users
order by users.group_id, users.age desc;
-- you might want to add extra column in ordering in case 2 users have the same age for same group_id
Translated in Rails, it would be:
User
.select('DISTINCT ON (users.group_id), users.*')
.order('users.group_id, users.age DESC')
Some doc about DISTINCT ON: https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT
Working example: https://www.db-fiddle.com/f/t4jeW4Sy91oxEfjMKYJpB1/0
You could use ROW_NUMBER/RANK(if ties are possible) windowed functions:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY group_id ORDER BY age DESC) AS rn
FROM tab) s
WHERE s.rn = 1;
you can use a subquery wuth aggreagated resul in join
select m.*
from users m
inner join (
select group_id, max(age) max_age
from users
group by group_id
) AS t on (t.group_id = m.group_id and t.max_age = m.age)

PostgreSQL: Order by multiple column weights

I'm using PostgreSQL 9.4. I've got a resources table which has the following columns:
id
name
provider
description
category
Let's say none of these columns is required (save for the id). I want resources to have a completion level, meaning that resources with NULL values for each column will be at 0% completion level.
Now, each column has a percentage weight. Let's say:
name: 40%
provider: 30%
description: 20%
category: 10%
So if a resource has a provider and a category, its completion level is at 60%.
These weight percentages could change at any time, so having a completion_level column which always has the value of the completion level will not work out (there could be million of resources). For example, at any moment, the percentage weight of description could decrease from 20% to 10% and category's from 10% to 20%. Maybe even other columns could be created and have their own weight.
The final objective is to be able to order resources by their completion levels.
I'm not sure how to approach this. I'm currently using Rails so almost all interaction with the database has been through the ORM, which I believe is not going to be much help in this case.
The only query I've found that somewhat resembles a solution (and not really) is to do something like the following:
SELECT * from resources
ORDER BY CASE name IS NOT NULL AND
provider IS NOT NULL AND
description is NOT NULL AND
category IS NOT NULL THEN 100
WHEN name is NULL AND provider IS NOT NULL...
However, there I must per mutate by every possible combination and that's pretty bad.
Add a weights table as in this SQL Fiddle:
PostgreSQL 9.6 Schema Setup:
CREATE TABLE resource_weights
( id int primary key check(id = 1)
, name numeric
, provider numeric
, description numeric
, category numeric);
INSERT INTO resource_weights
(id, name, provider, description, category)
VALUES
(1, .4, .3, .2, .1);
CREATE TABLE resources
( id int
, name varchar(50)
, provider varchar(50)
, description varchar(50)
, category varchar(50));
INSERT INTO resources
(id, name, provider, description, category)
VALUES
(1, 'abc', 'abc', 'abc', 'abc'),
(2, NULL, 'abc', 'abc', 'abc'),
(3, NULL, NULL, 'abc', 'abc'),
(4, NULL, 'abc', NULL, NULL);
Then calculate your weights at runtime like this
Query 1:
select r.*
, case when r.name is null then 0 else w.name end
+ case when r.provider is null then 0 else w.provider end
+ case when r.description is null then 0 else w.description end
+ case when r.category is null then 0 else w.category end weight
from resources r
cross join resource_weights w
order by weight desc
Results:
| id | name | provider | description | category | weight |
|----|--------|----------|-------------|----------|--------|
| 1 | abc | abc | abc | abc | 1 |
| 2 | (null) | abc | abc | abc | 0.6 |
| 3 | (null) | (null) | abc | abc | 0.3 |
| 4 | (null) | abc | (null) | (null) | 0.3 |
SQL's ORDER BY can order things by pretty much any expression; in particular, you can order by a sum. CASE is also fairly versatile (if somewhat verbose) and an expression so you can say things like:
case when name is not null then 40 else 0 end
which is more or less equivalent to name.nil?? 0 : 40 in Ruby.
Putting those together:
order by case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
Somewhat verbose but it'll do the right thing. Translating that into ActiveRecord is fairly easy:
query.order(Arel.sql(%q{
case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
}))
or in the other direction:
query.order(Arel.sql(%q{
case when name is not null then 40 else 0 end
+ case when provider is not null then 30 else 0 end
+ case when description is not null then 20 else 0 end
+ case when category is not null then 10 else 0 end
desc
}))
You'll need the Arel.sql call to avoid deprecation warnings in Rails 5.2+ as they don't want you to order(some_string) anymore, they just want you ordering by attributes unless you want to jump through some hoops to say that you really mean it.
Sum up weights like this:
SELECT * FROM resources
ORDER BY (CASE WHEN name IS NULL THEN 0 ELSE 40 END
+ CASE WHEN provider IS NULL THEN 0 ELSE 30 END
+ CASE WHEN description IS NULL THEN 0 ELSE 20 END
+ CASE WHEN category IS NULL THEN 0 ELSE 10 END) DESC;
This is how I would do it.
First: Weights
Since you say that the weights can chage from time to time, you have to create an structure to handle the changes. It could be a simple table. For this solution, it will be called weigths.
-- Table: weights
CREATE TABLE weights(id serial, table_nane text, column_name text, weight numeric(5,2));
id | table_name | column_name | weight
---+------------+--------------+--------
1 | resources | name | 40.00
2 | resources | provider | 30.00
3 | resources | description | 20.00
4 | resources | category | 10.00
So, when you need to change categories from 10 to 20 or/and description from 20 to 10, you update this structure.
Second: completion_level
Since you say that you could have millions of rows, it is ok to have completion_level column in the table resources; for efficiency purposes.
Making a query to get the completion_level works, you could have it in a view. But when you need the data fast and simple and you have MILLIONS of rows, it is better to set the data by "default" in a column or in another table.
When you have a view, every time you run it, it recreates the data. When you have it already on the table, it's fast and you don't have to recreate nothing, just query the data.
But how can you handle a completion_level? TRIGGERS
You would have to create a trigger for resources table. So, whenever you update or insert data, it will create the completion level.
First you add the column to the resources table
ALTER TABLE resources ADD COLUMN completion_level numeric(5,2);
And then you create the trigger:
CREATE OR REPLACE FUNCTION update_completion_level() RETURNS trigger AS $$
BEGIN
NEW.completion_level := (
CASE WHEN NEW.name IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='name') END
+ CASE WHEN NEW.provider IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='provider') END
+ CASE WHEN NEW.description IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='description') END
+ CASE WHEN NEW.category IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='category') END
);
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER resources_completion_level
BEFORE INSERT OR UPDATE
ON resources
FOR EACH ROW
EXECUTE PROCEDURE update_completion_level();
NOTE: table weights has a column called table_name; it's just in case you want to expand this functionality to other tables. In that case, you should update the trigger and add AND table_name='resources' in the query.
With this trigger, every time you update or insert you would have your completion_level ready so getting this data would be a simple query on resources table ;)
Third: What about old data and updates on weights?
Since the trigger only works for update and inserts, what about old data? or what if I change the weights of the columns?
Well, for those cases you could use a function to recreate all completion_level for every row.
CREATE OR REPLACE FUNCTION update_resources_completion_level() RETURNS void AS $$
BEGIN
UPDATE resources set completion_level = (
CASE WHEN name IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='name') END
+ CASE WHEN provider IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='provider') END
+ CASE WHEN description IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='description') END
+ CASE WHEN category IS NULL THEN 0
ELSE (SELECT weight FROM weights WHERE column_name='category') END
);
END $$ LANGUAGE plpgsql;
So everytime you update the weights or to update the OLD data, you just run the function
SELECT update_resources_completion_level();
Finally: What if I add columns?
Well, you would have to insert the new column in the weights table and update the functions (trigger and update_resources_completion_level()). Once everything is set, you run the function update_resources_completion_level() to set all weights acording to the changes :D

MySQL query that computes partial sums

What query should I execute in MySQL database to get a result containing partial sums of source table?
For example when I have table:
Id|Val
1 | 1
2 | 2
3 | 3
4 | 4
I'd like to get result like this:
Id|Val
1 | 1
2 | 3 # 1+2
3 | 6 # 1+2+3
4 | 10 # 1+2+3+4
Right now I get this result with a stored procedure containing a cursor and while loops. I'd like to find a better way to do this.
You can do this by joining the table on itself. The SUM will add up all rows up to this row:
select cur.id, sum(prev.val)
from TheTable cur
left join TheTable prev
on cur.id >= prev.id
group by cur.id
MySQL also allows the use of user variables to calculate this, which is more efficient but considered something of a hack:
select
id
, #running_total := #running_total + val AS RunningTotal
from TheTable
SELECT l.Id, SUM(r.Val) AS Val
FROM your_table AS l
INNER JOIN your_table AS r
ON l.Val >= r.Val
GROUP BY l.Id
ORDER By l.Id

Left Joins that link to multiple rows only returning one

I'm trying to join two table (call them table1 and table2) but only return 1 entry for each match. In table2, there is a column called 'current' that is either 'y', 'n', or 'null'. I have left joined the two tables and put a where clause to get me the 'y' and 'null' instances, those are easy. I need help to get the rows that join to rows that only have a 'n' to return one instance of a 'none' or 'null'. Here is an example
table1
ID
1
2
3
table2
ID | table1ID | current
1 | 1 | y
2 | 2 | null
3 | 3 | n
4 | 3 | n
5 | 3 | n
My current query joins on table1.ID=table2.table1ID and then has a where clause (where table2.current = 'y' or table2.current = 'null') but that doesn't work when there is no 'y' and the value isn't 'null'.
Can someone come up with a query that would join the table like I have but get me all 3 records from table1 like this?
Query Return
ID | table2ID | current
1 | 1 | y
2 | null | null
3 | 3 | null or none
First off, I'm assuming the "null" values are actually strings and not the DB value NULL.
If so, this query below should work (notice the inclusing of the where criteria INSIDE the ON sub-clause)
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current in ('y','null'))
If this does work, I would STRONGLY recommend changing the "null" string value to something else as it is entirely misleading... you or some other developer will lose time debugging this in the future.
If "null" acutally refers to the null value, then change the above query to:
select
table1.ID as ID
,table2.ID as table2ID
,table2.current
from table1 left outer join table2
on (table2.table1ID = table1.ID and
(table2.current = 'y' or table2.current is null))
you need to decide which of the three rows from table2 with table1id = 3 you want:
3 | 3 | n
4 | 3 | n
5 | 3 | n
what's the criterion?
select t1.id
, t2.id
, case when t2.count_current > 0 then
t2.count_current
else
null
end as current
from table1 t1
left outer join
(
select id
, max(table1id)
, sum(case when current = 'y' then 1 else 0 end) as count_current
from table2
group by id
) t2
on t1.id = t2.table1id
although, as justsomebody has pointed out, this may not work as you expect once you have multiple rows with 'y' in your table 2.

Resources