This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Pig Script: Join with multiple files
I do a program based on hadoop.
Now, I have three file A,B,C. And I want to join them and following the condition "A.one = B.one and A.two = C.one";Then store the result to file D.
I know a little about pig, but its join can't content this command.
Actually it is easy in Pig as two step join:
A=LOAD ..
B=LOAD ..
C=LOAD ..
AB= JOIN A BY A.one,B BY B.One;
D= JOIN AB BY A::two, C BY C.one;
Related
I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.
I have two huge tables in Hive. 'table 1' and 'table 2'. Both table has a common column 'key'.
I have queried 'table 1' with the desired conditions and created a DataFrame 'df1'.
Now, I want to query 'table 2' and want to use a column from 'df1' in the where clause.
Here is the code sample:
val df1 = hiveContext.sql("select * from table1 limit 100")
Can I do something like
val df2 = hiveContext.sql("select * from table2 where key = df1.key")
** Note : I don't want to make a single query with joining both tables
Any help will be appreciated.
Since you have explicitly written that you do NOT want to join the tables, then the short answer is "No, you cannot do such a query".
I'm not sure why you don't want to do the join, but it is definitely needed if you want to do the query. If you are worried about joining two "huge tables", then don't be. Spark was build for this kind of thing :)
The solution that I found is the following
Let me first give the dataset size.
Dataset1 - pretty small (10 GB)
Dataset2 - big (500 GB+)
There are two solutions to dataframe joins
Solution 1
If you are using Spark 1.6+, repartition both dataframes by the
column on which join has to be done. When I did it, the join was done
in less than 2 minutes.
df.repartition(df("key"))
Solution 2
If you are not using Spark 1.6+ (even if using 1.6+), if one
data is small, cache it and use that in broadcast
df_small.cache
df_big.join(broadcast(df_small) , "key"))
This was done in less than a minute.
Hi I have been trying to run this piece of pig script.
meddata = join meddata by (icddiag1 or icddiag2) left outer, codes by code;
with the intent of joining the meddata file if icddiag1 matches codes or if icddiag2 matches codes. I know I can join multiple times and do an unioin, but is there an easier way to do this?
I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.
I want to create a table in hive combining the columns of two tables.
So i want to create a single file in hdfs by including columns of both the files.
file1: a b c are the 3 columns
file2: x y z are the 3 columns
i want to create a file3: a b c x y z that has 6 columns.
How to do this ?
I tried many commands but it is appending the data into columns but i want all the columns in both the files to be present in a single file .
Thank you.
I think the simplest way will be to add id column to both tables (you need some column to do the join on) and then join tables on id column:
CREATE TABLE joined AS
SELECT first.id, first.a, first.b, first.c, second.x, second.y, second.z
FROM first JOIN second ON (first.id = second.id)