2 joins on same table while summing amounts - ruby-on-rails

I've got 2 tables that I need to efficiently pull data out of.
TableA
id
full_name
TableB
table_a_id_1
table_a_id_2
amount1
amount2
TableB references two different records on TableA (they will never be the same)
I am trying to write 1 query (or the most efficient query, trying to avoid n+2 query) to get a list of all records in TableA with the totaled sums for corresponding TableB
It would look something like this:
[{id: 1, full_name: 'bob', 1_sum1: 50.0, 1_sum2: 30.0, 2_sum1: 1.0, 2_sum2: 25.0}]
where
id and full_name is from TableA
1_sum1, 1_sum2, are both summed columns from TableB where table_a_id_1 = TableA.id
2_sum1, 2_sum2 are both summed columns from TableB where table_a_id_2 = TableA.id
I really hope this isn't confusing. I've got a query like this:
results = TableA.
joins('LEFT JOIN table_b t1_stats ON t1_stats.t1_aff_id = table_a.id').
joins('LEFT JOIN table_b t2_stats ON t2_stats.t2_aff_id = table_a.id').
select("table_a.*,
sum(t1_stats.amount1) AS 1_sum1,
sum(t2_stats.amount1) AS 2_sum1,
sum(t1_stats.amount2) AS 1_sum2,
sum(t2_stats.amount2) AS 2_sum2").
group('table_a.id')
I'm not getting results that are correct on the summed totals. I think its because it should be only summing records for 1_sum1 on records from table_b where table_a_id_1 = table_a.id and instead I think it might be including all records, then summing them.
Do I need to do a sub select or something instead? This is not my strong point here, so any help on getting this query sorted out would be great!
Thanks

Try:
results = TableA.
select("table_a.*,
(SELECT sum(table_b.amount1) FROM table_b WHERE table_b.t1_aff_id = table_a.id) AS 1_sum1,
(SELECT sum(table_b.amount2) FROM table_b WHERE table_b.t1_aff_id = table_a.id) AS 1_sum2,
(SELECT sum(table_b.amount1) FROM table_b WHERE table_b.t2_aff_id = table_a.id) AS 2_sum1,
(SELECT sum(table_b.amount2) FROM table_b WHERE table_b.t2_aff_id = table_a.id) AS 2_sum2")

Related

LEFT JOIN using _PARTITIONDATE

I'm currently using StandardSQL in BigQuery, I tried to join two sets of table one of which is a pseudo-column table partitioned by day.
I tried to use this query below:
SELECT
DISTINCT DATE(create_time) AS date,
user_id,
city_name,
transaction_id,
price
FROM
table_1 a
LEFT JOIN (SELECT user_id, city_name FROM table_2) b
ON (a.user_id = b.user_id AND DATE(create_time) = _PARTITIONDATE)
I've tried this kind of JOIN (using _PARTITIONDATE) and worked out, but for this particular query I got an error message:
Unrecognized name: _PARTITIONDATE
Can anyone tell me why this happened, and how could I solve this? Thanks in advance.
The issue is that you are not selecting the _PARTITIONDATE field from table_2 when joining it so it can't recognize it:
SELECT user_id, city_name FROM table_2
In order to solve it you can add it as follows:
SELECT
DISTINCT DATE(create_time) AS date,
user_id,
city_name,
transaction_id,
price
FROM
table_1 a
LEFT JOIN (SELECT _PARTITIONDATE AS pd, user_id, city_name FROM table_2) b
ON (a.user_id = b.user_id AND DATE(create_time) = pd)
Note that you'll need an alias such as pd as it's a pseudocolumn
Probably it was working in the past if you were joining two tables directly such as in (you don't get selectivity benefits in that case):
FROM
table_1 a
LEFT JOIN table_2 b
ON (a.user_id = b.user_id AND DATE(create_time) = _PARTITIONDATE)

How can I join 2 tables?

I would like to join tables . Could you please help?
Select Number, OwnerId from DNIS.numbers
select ID,Name from DNIS.owners
Thank you.
Normally, SQL servers allow you to join tables from different databases as long as the former all belong to them. Here is an example showing you how to do this (all you have to do is to explicitly write the database names associated to each table in the query):
SELECT N.Number, N.OwnerId, O.ID, O.Name
FROM DB1.[dbo].DNIS numbers N
JOIN DB2.[dbo].DNIS owners O ON O.ID = N.OwnerId
You can also use the following syntax:
SELECT N.Number, N.OwnerId, O.ID, O.Name
FROM DB1..DNIS numbers N
JOIN DB2..DNIS owners O ON O.ID = N.OwnerId
In order to accomplish that you will have to specify the table and column names in your join statement, like so:
SELECT db1.tablename.column, db2.tablename.column
FROM db1.tablename INNER JOIN db2.tablename
ON db1.tablename.id = db2.tablename.id;

Join / Aggregate Function Query

I have the following code and output:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
This generates the following output:
When I add the following COUNT aggregate funcation and Inner Join:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory, COUNT(c.OrderID) AS Orders
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
INNER JOIN Sales.Orders AS c ON a.CustomerID = c.CustomerID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
The output changes to:
I am not sure as to why the CustomersInThisCategory column is changing to the same as the Orders column? I'm also not sure why the results in the first ouput with 0 values are being removed in the second query as I still have the Right join present.
Any feedback would be much appreciated.
For your first query, count ( distinct a.customerId) should give you the unique customer ids in a category.
Regarding your second question, right join is performed before inner join. So the inner join will splice the records, for which a match is not found.
Hope my answer helps.

Left outer join with 3 tables and subquery

sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.
This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?
Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.
If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]
How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Resources