Join tables in Hive using LIKE - join

I am joining tbl_A to tbl_B, on column CustomerID in tbl_A to column Output in tbl_B which contains customer ID. However, tbl_B has all other information in related rows that I do not want to lose when joining. I tried to join using like, but I lost rows that did not contain customer ID in the output column.
Here is my join query in Hive:
select a.*, b.Output from tbl_A a
left join tbl_B b
On b.Output like concat('%', a.CustomerID, '%')
However, I lose other rows from output.

You could also achieve the objective by a simple hive query like this :)
select a.*, b.Output
from tbl_A a, tbl_B b
where b.Output like concat('%', a.CustomerID, '%')

I would suggest first extract all ID's from free floating field which in your case is 'Output' column in table B into a separate table. Then join this table with ID's to Table B again to populate in each row the ID and then this second joined table which is table B with ID's to table A.
Hope this helps.

Related

Joining on column names with spaces

I'm trying to join to tables using PROQ SQL. One of the columns I'm using for the join has a space in the column name. The query I'm using:
PROC SQL;
CREATE TABLE TEST AS
SELECT a.*, b.*
FROM TABLE_1 a
INNER JOIN TABLE_2 b
ON a.CONTNO = b."Contract Number";
RUN;
This is the error I'm getting:
ERROR 22-322: Syntax error, expecting one of the following: a name, *.
How do I fix this?
You just need to add square brackets around the Column name. For example:
b.[Contract Number]
Tips: Using alias (a, b) can be costly. When you only have one table to join, consider typing out the table rather than doing an alias.

Join two datasets based on a flag and id

I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.
This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;

Hive join query to list columns from only one table

I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*

Left join with where clause not working

I was trying to get only selected rows from table A(not all rows) and rows matching table A from table B, but it shows only matching rows from table A and table B, excluding rest of the selected rows from table A.
I used this condition,
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode
WHERE A.itemtype='F' and B.party_code=2
i have 2 tables:
table 1: A with 3 columns
CategoryName,CategoryCode(PK),ItemType
table 2: B with 2 columns
CategoryCode(FK),Discount,PartyCode(FK)(from another table)
NOTE: working in access 2007
For non-matching rows from table B, party_code = NULL, so your where clause will evaluate to false and therefore the row won't be returned. So, you need to filter the "B" records before joining. Try
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode and B.party_code=2
WHERE A.itemtype='F'
[EDIT] That doesn't work in Access. next try.
You can create a query to do your filter. Let's call it "B_filtered". This is just
SELECT * FROM B where party_code = 2
(You could make the "2" a parameter to make it more flexible).
Then, just use this query in your actual query.
SELECT A.CategoryName,B_filtered.discount
from A LEFT JOIN B_filtered ON A.CategoryCode = B_filtered.CategoryCode
WHERE A.itemtype='F'
[EDIT]
Just Googled - I think you can do this directly with a subquery.
SELECT A.CategoryName,B_filtered.discount
from A LEFT JOIN (SELECT * FROM B where party_code = 2) AS B_filtered ON A.CategoryCode = B_filtered.CategoryCode
WHERE A.itemtype='F'
What mlinth proposed is correct, and would work for most other SQL languages. The query below is the same basic concept but using a null condition.
Try:
SELECT A.CategoryName,B.discount
from A LEFT JOIN B ON A.CategoryCode = B.CategoryCode
WHERE A.itemtype='F' and (B.party_code=2 OR B.party_code IS NULL)
If party_code is nullable, switch to using the PK or another non-nullable field.

select multiple columns from different tables and join in hive

I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.

Resources