Left outer join with Where Clause - join

I am experienced with Access and about 12 months into SQL Server SSMS.
I am not getting results I expect with a left outer join, and I don't know why. Maybe I don't understand something.
I have Table 1 (the left side) with 600k products
I have table 2 with 150,000 products (sub set of table 1).
When I do this
SELECT [Product_Code], [Product_Desc], Store
FROM [Product Range]
I get 600,000 records
When I do a left join like this
SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
FROM [Product Range] as r
LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
WHERE soh.CalYearWeek=1512
I get 500k records. But I am confused. I thought a left join was supposed to return me all records from my left table regardless of anything else.
I then tried this (and I don't know why I would need to add the Null condition anyway)
SELECT [Product_Code],[Product_Desc],r.store,soh.SOH
FROM [Product Range] as r
LEFT OUTER JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
WHERE soh.CalYearWeek=1512 or soh.CalYearWeek is null
and I got 550,000 records - still not the full 600k.
I am completely confused and don't know what is wrong. Can anyone help me please :-)
Matt

The problem us the WHERE conditions are executed after the join is made, so soh.CalYearWeek=1512 will only be true for successful joins - missed joins have all nulls, and the where clause filters them out.
The solution is simple: Move the condition into the join:
SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
FROM [Product Range] as r
LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
AND soh.CalYearWeek=1512
Conditions on the join are executed as the join is being made, so you'll still get your left join, but only to rows in the right table that have that special condition.
Putting non-null conditions on the right table in the WHERE clause effectively turns a LEFT join into an INNER join, since the right table can only have a non-null value if the join was successful.

You're correct in that a basic left join with no WHERE clauses will return a row for all records in the LEFT table with either data for the RIGHT table when it exists, or NULL where it doesn't.
And that is what you're getting, but then you're adding a WHERE clause which will filter out certain rows. So if you just had :
SELECT [Product_Code] ,[Product_Desc] ,r.store ,soh.SOH
FROM [Product Range] as r left join [dbo].SOH as soh
on r.[Product_Code] = soh.PRODUCT_Code
and r.store = soh.store
Then you would be seeing 600k records returned.
But then you're removing the 100k records where soh.CalYearWeek is not 1512 with the line :
WHERE soh.CalYearWeek=1512
By adding the :
or soh.CalYearWeek is null
You are adding back 50k more records where that is true. So basically, the WHERE clause acts upon the whole set of records at that time (after the join has taken place) and filters out rows which don't match. The mention of RIGHTTABLE.COLUMN in a where clause is really just because by then, the column in the full row is decribed by that full identifier rather than just its column name alone.

In fact the problem is not in WHERE clause. The problem, if you can call this a problem, is in JOIN itself and how it behaves. In fact you can get exactly 600K rows, no rows at all, less then 600K rows or even more then 600K rows. It depends on data in those tables.
You should understand difference between putting predicates in JOIN condition and WHERE clause. There is a big difference. Also you should understand how predicates work with NULLs.
If you have a row with code 'A' in left table, and no row with code 'A' in right table you will get one row from left table and NULLs from right table. If in right table you have one row with code 'A' you will get 1 row from left and one row from right. If you have N rows with code 'A' in left table and M rows with code 'A' in right one, you will get M*N rows in result.
To summarize here is formula for calculating number of rows in result set when using LEFT JOIN:
COUNT = Count of rows from left table where there are no corresponding rows from right table + SUM(COUNT(code[i])*COUNT(code[i])), i.e. sum of cartesian product of counts of distinct matching codes from both tables.
You get at least 600K rows after left join. In year column you can get NULLs in two ways: 1. there was no corresponding row for code in right table, 2. there was corresponding row from right table but column year is NULL itself.
When you are further filtering resultset with soh.CalYearWeek=1512, rows with NULLs and different values are eliminated from result.
Consider example:
DECLARE #t1 TABLE(Code INT)
DECLARE #t2 TABLE(Code INT, Year INT)
INSERT INTO #t1 VALUES
(1), (2), (3)
SELECT * FROM #t1 t1
JOIN #t2 t2 ON t2.Code = t1.Code
WHERE t2.Year = 1512
And now different results depending on data in second table:
--count 1
INSERT INTO #t2 VALUES
(1, 1512)
--count 0
INSERT INTO #t2 VALUES
(1, NULL)
--count 3
INSERT INTO #t2 VALUES
(1, 1512), (1, 1512), (1, 1512)
--count 6
INSERT INTO #t2 VALUES
(1, 1512), (2, 1512), (2, 1512), (3, 1512), (3, 1512), (3, 1512)

Related

Left outer join with 3 tables and subquery

sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.
This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?
Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.
If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]
How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Left join with where clause not working

I was trying to get only selected rows from table A(not all rows) and rows matching table A from table B, but it shows only matching rows from table A and table B, excluding rest of the selected rows from table A.
I used this condition,
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode
WHERE A.itemtype='F' and B.party_code=2
i have 2 tables:
table 1: A with 3 columns
CategoryName,CategoryCode(PK),ItemType
table 2: B with 2 columns
CategoryCode(FK),Discount,PartyCode(FK)(from another table)
NOTE: working in access 2007
For non-matching rows from table B, party_code = NULL, so your where clause will evaluate to false and therefore the row won't be returned. So, you need to filter the "B" records before joining. Try
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode and B.party_code=2
WHERE A.itemtype='F'
[EDIT] That doesn't work in Access. next try.
You can create a query to do your filter. Let's call it "B_filtered". This is just
SELECT * FROM B where party_code = 2
(You could make the "2" a parameter to make it more flexible).
Then, just use this query in your actual query.
SELECT A.CategoryName,B_filtered.discount
from A LEFT JOIN B_filtered ON A.CategoryCode = B_filtered.CategoryCode
WHERE A.itemtype='F'
[EDIT]
Just Googled - I think you can do this directly with a subquery.
SELECT A.CategoryName,B_filtered.discount
from A LEFT JOIN (SELECT * FROM B where party_code = 2) AS B_filtered ON A.CategoryCode = B_filtered.CategoryCode
WHERE A.itemtype='F'
What mlinth proposed is correct, and would work for most other SQL languages. The query below is the same basic concept but using a null condition.
Try:
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode
WHERE A.itemtype='F' and (B.party_code=2 OR B.party_code IS NULL)
If party_code is nullable, switch to using the PK or another non-nullable field.

Performance Issues with sqlite joins

I've been working on building a database structure, when I ran into a problem. I have 3 tables A,B,C. Tables B and C have a 1-1 relationship while Tables A and B have a many-1 relationship. I attempted to run the following query.
SELECT A.id
FROM A
INNER JOIN B ON A.B_id = B.id
INNER JOIN C ON B.id = C.B_id
LIMIT 0,40
The query never completed, and the program ran for several seconds before not responding. Seeing as this query will need to return thousands of records, I was rather distraught that it didn't work limited to only 40 records. I then remembered that indices existed, so I created an index on all of the join criteria. I created one for A.B_id,B.id, and C.B_id. The result was a query that worked. It worked after I removed the limit clause as well, so I proceeded to the next query.
SELECT A.id
FROM A
INNER JOIN B ON A.B_id = B.id
LEFT OUTER JOIN C ON B.id = C.B_id
LIMIT 0,40
Note that the only difference is the second join is now a left outer join. I though that since the keys are all the same, the index should still speed this one up. I was incorrect, as the query above completed, but was rather slow. I removed the limit clause, and the query didn't complete. I removed the indices that I added previously and tried the limited statement again. It ran in the same time.
The problem ended up being that without indices, the INNER JOIN did not work at all and the LEFT OUTER JOIN worked only when limited, though it was also slow. With the indices, the INNER JOIN successfully completed as quickly as I need it to be, but the LEFT OUTER JOIN continued to work about the same.
Table A has 200,000 records
Table B has 50,000 records
Table C has 10,000 records
The LEFT OUTER JOIN query above should produce 40,000 records
Does anyone know a way to speed this up? Am I missing an index or something that would increase the performance of the LEFT OUTER JOIN?
As requested, here are the outputs for explain query plan:
For the two INNER JOINS:
"0" "0" "2" "SCAN TABLE C USING COVERING INDEX c.bid (~1000000 rows)"
"0" "1" "1" "SEARCH TABLE B USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)"
"0" "2" "0" "SEARCH TABLE A USING COVERING INDEX a.bid (B_id=?) (~10 rows)"
For the INNER JOIN and OUTER JOIN:
"0" "0" "0" "SCAN TABLE A USING COVERING INDEX a.bid (~1000000 rows)"
"0" "1" "1" "SEARCH TABLE B USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)"
"0" "2" "2" "SCAN TABLE C USING COVERING INDEX c.bid (~100000 rows)"

select multiple columns from different tables and join in hive

I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.

Resources