I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2.
Since I am very-very new to SAS, I am trying to stick to SQL logic.
So basically I did (shot version)
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
OR DS1.COL3=DS2.COL3
OR DS1.COL4=DS2.COL2
...
After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?
I would use 3 separate queries and use a UNION
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
On DS1.COL3=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
ON DS1.COL4=DS2.COL2
...
You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.
I suggest adding additional clauses to exclude null results.
Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.
Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).
EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.
First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.
Here is how you can do it.
I have assumed you match field as ID
proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;
data out;
merge DS1 DS2;
by ID;
run;
You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.
Related
I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.
This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;
I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.
According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.
I've got multiple master tables in the same format with the same variables. I now want to left join another variable but I can't combine the master tables due to limited storage on my computer. Is there a way that I can left join a variable onto multiple master tables within one PROC SQL? Maybe with the help of a macro?
The LEFT JOIN code looks like this for one join but I'm looking for an alternative than to copy and paste this 5 times:
PROC SQL;
CREATE TABLE New AS
SELECT a.*, b.Value
FROM Old a LEFT JOIN Additional b
ON a.ID = b.ID;
QUIT;
You can't do it in one create table statement, as it only creates one table at a time. But you can do a few things, depending on what your actual limiting factor is (you mention a few).
If you simply want to avoid writing the same code five times, but otherwise don't care how it executes, then just write the code in a macro, as you reference.
%macro update_table(old=, new=);
PROC SQL;
CREATE TABLE &new. AS
SELECT a.*, b.Value
FROM &old. a LEFT JOIN Additional b
ON a.ID = b.ID;
QUIT;
%mend update_table;
%update_table(old=old1, new=new1)
%update_table(old=old2, new=new2)
%update_table(old=old3, new=new3)
Of course, if the names of the five tables are in a pattern, you can perhaps automate this further based on that pattern, but you don't give sufficient information to figure that out.
If you on the other hand need to do this more efficiently in terms of processing than running the SQL query five times, it can be done a number of ways, depending on the specifics of your additional table and your specific limitations. It looks to me that you have a good use case for a format lookup here, for example; see for example Jenine Eason's paper, Proc Format, a Speedy Alternative to Sort/Merge. If you're just merging on the ID, this is very easy.
data for_format;
set additional;
start = ID;
label = value;
fmtname='AdditionalF'; *or '$AdditionalF' if ID is character-valued;
output;
if _n_=1 then do; *creating an "other" option so it returns missing if not found;
hlo='o';
label = ' ';
output;
end;
run;
And then you just have five data steps with a PUT statement adding the value, or even you could simply format the ID variable with that format and it would have that value whenever you did most PROCs (if this is something like a classifier that you don't truly need "in" the data).
You can do this in a single pass through the data in a Data Step using a hash table to lookup values.
data new1 new2 new3;
set old1(in=a) old2(in=b) old3(in=c);
format value best.;
if _n_=1 then do;
%create_hash(lk,id,value,"Additional");
end;
value = .;
rc = lk.find();
drop rc;
if a then
output new1;
else if b then
output new2;
else if c then
output new3;
run;
%create_hash() macro available here.
You could, alternatively, use Joe's format with the same Data Step syntax.
I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*