I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*
Related
I'm trying to join to tables using PROQ SQL. One of the columns I'm using for the join has a space in the column name. The query I'm using:
PROC SQL;
CREATE TABLE TEST AS
SELECT a.*, b.*
FROM TABLE_1 a
INNER JOIN TABLE_2 b
ON a.CONTNO = b."Contract Number";
RUN;
This is the error I'm getting:
ERROR 22-322: Syntax error, expecting one of the following: a name, *.
How do I fix this?
You just need to add square brackets around the Column name. For example:
b.[Contract Number]
Tips: Using alias (a, b) can be costly. When you only have one table to join, consider typing out the table rather than doing an alias.
I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.
This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;
I am joining tbl_A to tbl_B, on column CustomerID in tbl_A to column Output in tbl_B which contains customer ID. However, tbl_B has all other information in related rows that I do not want to lose when joining. I tried to join using like, but I lost rows that did not contain customer ID in the output column.
Here is my join query in Hive:
select a.*, b.Output from tbl_A a
left join tbl_B b
On b.Output like concat('%', a.CustomerID, '%')
However, I lose other rows from output.
You could also achieve the objective by a simple hive query like this :)
select a.*, b.Output
from tbl_A a, tbl_B b
where b.Output like concat('%', a.CustomerID, '%')
I would suggest first extract all ID's from free floating field which in your case is 'Output' column in table B into a separate table. Then join this table with ID's to Table B again to populate in each row the ID and then this second joined table which is table B with ID's to table A.
Hope this helps.
I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2.
Since I am very-very new to SAS, I am trying to stick to SQL logic.
So basically I did (shot version)
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
OR DS1.COL3=DS2.COL3
OR DS1.COL4=DS2.COL2
...
After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?
I would use 3 separate queries and use a UNION
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
On DS1.COL3=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
ON DS1.COL4=DS2.COL2
...
You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.
I suggest adding additional clauses to exclude null results.
Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.
Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).
EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.
First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.
Here is how you can do it.
I have assumed you match field as ID
proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;
data out;
merge DS1 DS2;
by ID;
run;
You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.
I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.