Join two datasets based on a flag and id - join

I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.

This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;

Related

Hive-How to join tables with OR clause in ON statement

I've got the following problem. In my oracle db I have query as follows:
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
and it works perfectly.
Nowadays I need to re-write query on hive. I've seen that OR clause doesn't work in JOINS in hive (error warning : 'OR not supported in JOIN').
Is there any workaround for this except splitting query between two separate and union them?
Another way is to union two joins, e.g.,
select * from table1 t1
inner join table2 t2 on
(t1.id_1= t2.id_1)
union all
select * from table1 t1
inner join table2 t2 on
(t1.id_2 = t2.id_2)
Hive does not support non-equi joins. Common approach is to move join ON condition to the WHERE clause. In the worst case it will be the CROSS JOIN + WHERE filter, like this:
select *
from table1 t1
cross join table2 t2
where (t1.id_1= t2.id_1 or t1.id_2 = t2.id_2)
It may work slow because of rows multiplication by CROSS JOIN.
You can try to do two LEFT joins instead of CROSS and filter out cases when both conditions are false (like INNER JOIN in your query). This may perform faster than cross join because will not multiply all the rows. Also columns selected from second table can be calculated using NVL() or coalesce().
select t1.*,
nvl(t2.col1, t3.col1) as t2_col1, --take from t2, if NULL, take from t3
... calculate all other columns from second table in the same way
from table1 t1
left join table2 t2 on t1.id_1= t2.id_1
left join table2 t3 on t1.id_2 = t3.id_2
where (t1.id_1= t2.id_1 OR t1.id_2 = t3.id_2) --Only joined records allowed likke in your INNER join
As you asked, no UNION is necessary.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Join tables in Hive using LIKE

I am joining tbl_A to tbl_B, on column CustomerID in tbl_A to column Output in tbl_B which contains customer ID. However, tbl_B has all other information in related rows that I do not want to lose when joining. I tried to join using like, but I lost rows that did not contain customer ID in the output column.
Here is my join query in Hive:
select a.*, b.Output from tbl_A a
left join tbl_B b
On b.Output like concat('%', a.CustomerID, '%')
However, I lose other rows from output.
You could also achieve the objective by a simple hive query like this :)
select a.*, b.Output
from tbl_A a, tbl_B b
where b.Output like concat('%', a.CustomerID, '%')
I would suggest first extract all ID's from free floating field which in your case is 'Output' column in table B into a separate table. Then join this table with ID's to Table B again to populate in each row the ID and then this second joined table which is table B with ID's to table A.
Hope this helps.

Hive join query to list columns from only one table

I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*

Hive: Not in subquery join

I'm looking for a way to select all values from one table which do no exits in other table. This needs to be done on two variables, not one.
select * from tb1
where tb1.id1 not in (select id1 from tb2)
and tb1.id2 not in (select id2 from tb2)
I cannot use subquery. It needs to be done using joins only.
I tried this:
select * from tb1 full join tb2 on
tb1.id1=tb2.id1 and tb1.id2=tb2.id2
This works fine with one variable in condition, but not two.
Please suggest some resolution.
Since you are looking to get all the data from tb1 with no common data on columns id1 and id2 on tb2, You can use a left outer join on table tb1. Something like
SELECT tb1.* FROM tb1 LEFT OUTER JOIN tb2 ON
(tb1.id1=tb2.id1 AND tb1.id2=tb2.id2)
WHERE tb2.id1 IS NULL

Resources