Hive: Not in subquery join

Hive: Not in subquery join - join

I'm looking for a way to select all values from one table which do no exits in other table. This needs to be done on two variables, not one.
select * from tb1
where tb1.id1 not in (select id1 from tb2)
and tb1.id2 not in (select id2 from tb2)
I cannot use subquery. It needs to be done using joins only.
I tried this:
select * from tb1 full join tb2 on
tb1.id1=tb2.id1 and tb1.id2=tb2.id2
This works fine with one variable in condition, but not two.
Please suggest some resolution.

Since you are looking to get all the data from tb1 with no common data on columns id1 and id2 on tb2, You can use a left outer join on table tb1. Something like
SELECT tb1.* FROM tb1 LEFT OUTER JOIN tb2 ON
(tb1.id1=tb2.id1 AND tb1.id2=tb2.id2)
WHERE tb2.id1 IS NULL

Related

Join two datasets based on a flag and id

I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.

This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;

Hive-How to join tables with OR clause in ON statement

I've got the following problem. In my oracle db I have query as follows:
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
and it works perfectly.
Nowadays I need to re-write query on hive. I've seen that OR clause doesn't work in JOINS in hive (error warning : 'OR not supported in JOIN').
Is there any workaround for this except splitting query between two separate and union them?

Another way is to union two joins, e.g.,
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1)
union all
select * from table1 t1
inner join table2 t2 on
(t1.id_2 = t2.id_2)

Hive does not support non-equi joins. Common approach is to move join ON condition to the WHERE clause. In the worst case it will be the CROSS JOIN + WHERE filter, like this:
select *
from table1 t1
cross join table2 t2
where (t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
It may work slow because of rows multiplication by CROSS JOIN.
You can try to do two LEFT joins instead of CROSS and filter out cases when both conditions are false (like INNER JOIN in your query). This may perform faster than cross join because will not multiply all the rows. Also columns selected from second table can be calculated using NVL() or coalesce().
select t1.*,
nvl(t2.col1, t3.col1) as t2_col1, --take from t2, if NULL, take from t3
... calculate all other columns from second table in the same way
from table1 t1
left join table2 t2 on t1.id_1= t2.id_1
left join table2 t3 on t1.id_2 = t3.id_2
where (t1.id_1= t2.id_1 OR t1.id_2 = t3.id_2) --Only joined records allowed likke in your INNER join
As you asked, no UNION is necessary.

Join tables in Hive using LIKE

I am joining tbl_A to tbl_B, on column CustomerID in tbl_A to column Output in tbl_B which contains customer ID. However, tbl_B has all other information in related rows that I do not want to lose when joining. I tried to join using like, but I lost rows that did not contain customer ID in the output column.
Here is my join query in Hive:
select a.*, b.Output from tbl_A a
left join tbl_B b
On b.Output like concat('%', a.CustomerID, '%')
However, I lose other rows from output.

You could also achieve the objective by a simple hive query like this :)
select a.*, b.Output
from tbl_A a, tbl_B b
where b.Output like concat('%', a.CustomerID, '%')

I would suggest first extract all ID's from free floating field which in your case is 'Output' column in table B into a separate table. Then join this table with ID's to Table B again to populate in each row the ID and then this second joined table which is table B with ID's to table A.
Hope this helps.

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.

If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]

How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Get incremental changes between Hive partitions

I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?

This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.

So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Hive: Not in subquery join - join

Since you are looking to get all the data from tb1 with no common data on columns id1 and id2 on tb2, You can use a left outer join on table tb1. Something like SELECT tb1.* FROM tb1 LEFT OUTER JOIN tb2 ON (tb1.id1=tb2.id1 AND tb1.id2=tb2.id2) WHERE tb2.id1 IS NULL

Related

Join two datasets based on a flag and id

Hive-How to join tables with OR clause in ON statement

Join tables in Hive using LIKE

Redshift - Efficient JOIN clause with OR

Get incremental changes between Hive partitions

Categories

Resources