Hive-How to join tables with OR clause in ON statement

Hive-How to join tables with OR clause in ON statement - join

I've got the following problem. In my oracle db I have query as follows:
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
and it works perfectly.
Nowadays I need to re-write query on hive. I've seen that OR clause doesn't work in JOINS in hive (error warning : 'OR not supported in JOIN').
Is there any workaround for this except splitting query between two separate and union them?

Another way is to union two joins, e.g.,
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1)
union all
select * from table1 t1
inner join table2 t2 on
(t1.id_2 = t2.id_2)

Hive does not support non-equi joins. Common approach is to move join ON condition to the WHERE clause. In the worst case it will be the CROSS JOIN + WHERE filter, like this:
select *
from table1 t1
cross join table2 t2
where (t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
It may work slow because of rows multiplication by CROSS JOIN.
You can try to do two LEFT joins instead of CROSS and filter out cases when both conditions are false (like INNER JOIN in your query). This may perform faster than cross join because will not multiply all the rows. Also columns selected from second table can be calculated using NVL() or coalesce().
select t1.*,
nvl(t2.col1, t3.col1) as t2_col1, --take from t2, if NULL, take from t3
... calculate all other columns from second table in the same way
from table1 t1
left join table2 t2 on t1.id_1= t2.id_1
left join table2 t3 on t1.id_2 = t3.id_2
where (t1.id_1= t2.id_1 OR t1.id_2 = t3.id_2) --Only joined records allowed likke in your INNER join
As you asked, no UNION is necessary.

Related

Join two datasets based on a flag and id

I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.

This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.

If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]

How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Complex Join to Return SuperSet

Evening All,
I have been chipping away at this one for a while and for some reason i just can't seem to get my logic to return the way I expect it to.
I have 3 Data tables as well as 3 business concept linking tables.
Table1
Table2
Table3
Rules:
Table 1 can be linked to Table 2
Table 1 can be directly linked to table 3
Table 1 can be indirectly linked to table 3 via table 2
I have tried a fair few variations however It seems to truncate records.
SELECT
*
FROM
Table1 T1
INNER JOIN Table1_to_Table2_Link L1 on L1.T1_ID = T1.ID
RIGHT JOIN TABLE2 T2 ON L1.T2_ID = T2.ID
INNER JOIN Table2_to_Table3_Link L2 ON L2.T2_ID = T2.ID
Right JOIN Table3 T3 ON L2.T3_ID = T3.ID
INNER JOIN Table1_to_Table3_Link L3 on T1.ID = L3.T1_ID
Its a bit awkward to explain but in summart
I require All the Data from Table 1
And only the Data in Tables 2 and 3 if they are directly/indirectly related to table 1. Tables 2 and 3 don't necessarily have to have a related business concept.
The Return Expected is;
Any assistance would be kindly appreciated

You were right. It was not that simple. However I could get desired output by below query
SELECT
T1.*,
T2.*,
T3.*
FROM
Table1 T1
LEFT JOIN Table1_to_Table2_Link L1 on T1.ID = L1.T1_ID
LEFT JOIN TABLE2 T2 ON T2.ID = L1.T2_ID
LEFT JOIN (
SELECT T1_ID AS ID,T3_ID AS table3Id FROM dbo.Table1_to_Table3_Link
UNION ALL
SELECT T2_ID AS ID,T3_ID AS table3Id FROM dbo.Table2_to_Table3_Link
) S
ON T1.ID = s.ID
OR t2.ID = s.id
LEFT JOIN dbo.Table3 T3 ON S.table3Id = T3.ID
Hope it helps.

Error in Hive Query while joining tables

I am unable to pass the equality check using the below HIVE query.
I have 3 table and i want to join these table. I trying as below, but get error :
FAILED: Error in semantic analysis: Line 3:40 Both left and right aliases encountered in JOIN 'visit_date'
select t1.*, t99.* from table1 t1 JOIN
(select v3.*, t3.* from table2 v3 JOIN table3 t3 ON
( v3.AS_upc= t3.upc_no AND v3.start_dt <= t3.visit_date AND v3.end_dt >= t3.visit_date AND v3.adv_price <= t3.comp_price ) ) t99 ON
(t1.comp_store_id = t99.cpnumber AND t1.AS_store_nbr = t99.store_no);
EDITED based on help from FuzzyTree:
1st:
We tried to edit above query using between and where clause, but not getting any output from the query.
But If we changed the above query by removing the between clause with date, then I got some output based on "v3.adv_price <= t3.comp_price", but not using "date filter".
select t1.*, t99.* from table1 t1 JOIN
(select v3.*, t3.* from table2 v3 JOIN table3 t3 on (v3.AS_upc= t3.upc_no)
where v3.adv_price <= t3.comp_price
) t99 ON
(t1.comp_store_id = t99.cpnumber AND t1.AS_store_nbr = t99.store_no);
2nd :
Next we tried to pass only one date as :
select t1.*, t99.* from table1 t1 JOIN
(select v3.*, t3.* from table2 v3 JOIN table3 t3 on (v3.AS_upc= t3.upc_no)
where v3.adv_price <= t3.comp_price and v3.start_dt <= t3.visit_date
) t99 ON
(t1.comp_store_id = t99.cpnumber AND t1.AS_store_nbr = t99.store_no);
So, now it's showing some result but if we pass both the start and end date filter, it; not showing any result.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
Only equality joins, outer joins, and left semi joins are supported in
Hive. Hive does not support join conditions that are not equality
conditions as it is very difficult to express such conditions as a
map/reduce job.
Try moving your inequalities to the where clause
select t1.*, t99.* from table1 t1 JOIN
(select v3.*, t3.* from table2 v3 JOIN table3 t3 on (v3.AS_upc= t3.upc_no)
where t3.visit_date between v3.start_dt and v3.end_dt
and v3.adv_price <= t3.comp_price
) t99 ON
(t1.comp_store_id = t99.cpnumber AND t1.AS_store_nbr = t99.store_no);

query from multiple tables using inner joins takes long to execute

I have an sql query below that is taking too long to execute. kindly check the query and optimise it for me, i need to count number of files from a file_Actions table but combining it three other tables using inner join
SELECT count(*) as total
FROM (SELECT t1.cfid as cfid,MAX(t1.timestamp) d
FROM file_actions t1
INNER JOIN case_files t2 ON t2.cfid=t1.cfid
INNER JOIN case_file_allocations t3 ON t1.cfid=t3.cfid
INNER JOIN cbeta_user t4
WHERE t4.id=t1.user_id
AND t4.team_leader='$user' and t2.closed<>'yes' AND
t2.deleted<>1 AND
t3.reallocated<>'yes' GROUP BY t1.cfid) a
WHERE d < '$yesterday'
I think it is the inner joins that causes the query to take so long to execute causing the system to slow

Try including the WHERE d < '$yesterday' into the subquery a. Remove the fields and place the Count(*). If your tables aren't indexed on those values that you are using for conditions and relations, try to make an index.
SELECT count(*) as total
FROM file_actions t1
INNER JOIN case_files t2
ON t2.cfid=t1.cfid
INNER JOIN case_file_allocations t3
ON t1.cfid=t3.cfid
INNER JOIN cbeta_user t4
WHERE t4.id=t1.user_id
AND t4.team_leader='$user' and t2.closed<>'yes'
AND t2.deleted<>1
AND t3.reallocated<>'yes'
AND d < '$yesterday'
GROUP BY t1.cfid
Recommended reading: http://www.code-fly.com/5-tips-to-make-your-sql-queries-faster/

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Hive-How to join tables with OR clause in ON statement - join

Another way is to union two joins, e.g., select * from table1 t1 inner join table2 t2 on (t1.id_1= t2.id_1) union all select * from table1 t1 inner join table2 t2 on (t1.id_2 = t2.id_2)

Related

Join two datasets based on a flag and id

Redshift - Efficient JOIN clause with OR

Complex Join to Return SuperSet

Error in Hive Query while joining tables

query from multiple tables using inner joins takes long to execute

Categories

Resources