How do I perform a left outer join using SPSS commands? - spss

Can SPSS commands (e.g., MERGE FILES) be used to perform a left outer join between 2 SPSS datasets? Assume that the join field is not unique in either dataset.
Example:
Let the left Dataset1 contains 2 fields - ClassNbr and Fact1 - and these 4 records . . .
1 A
1 D
2 A
3 B
Let Dataset2 contains 2 fields - ClassNbr and Fact2 - and these 3 records . . .
1 XX
1 XY
3 ZZ
I want to join Dataset1 and Dataset2 on ClassNbr. The desired result is a 6 record dataset as follows:
1 A XX
1 A XY
1 D XX
1 D XY
2 A (NULL)
3 B ZZ
I would prefer a solution that uses SPSS commands (as opposed to SQL/Python/etc.).

As far as I'm aware you can not do this directly. One potential way to do the workaround is to "reshape" the data from long format to wide format (using casestovars), do the merge, and then reshape back into long format (using varstocases). Below is a use example (if any clarification is needed on the code just ask).
data list free / ClassNbr (F1) Fact1 (A1).
begin data
1 A
1 D
2 A
3 B
end data.
dataset name data1.
casestovars
/id = ClassNbr.
data list free / ClassNbr (F1) Fact2 (A2).
begin data
1 XX
1 XY
3 ZZ
end data.
dataset name data2.
casestovars
/id = ClassNbr.
match files file = 'data1'
/file = 'data2'
/by ClassNbr.
execute.
varstocases
/make Fact1 FROM Fact1.1 to Fact1.2
/null = KEEP.
varstocases
/make Fact2 FROM Fact2.1 to Fact2.2
/null = KEEP.
This creates some cases that you do not want, here I have just defined a set of commands to identify those cases and take them out (I'm sure this could be improved to be more efficient).
*now cleaning up the extra records.
compute flag = 0.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = lag(Fact2) flag = 1.
select if flag = 0.
execute.
if Fact1 = " " and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact2 = lag(Fact2) and Fact1 = " " flag = 1.
select if flag = 0.
execute.
I'm sure it would be possible to make this more robust (probably making some custom python functions). But hopefully this helps get you started.

You can do this if you install the "STATS CARTPROD" extension bundle. With this extension you can create a cartesian product as an intermediate step to create an outer join.
Since SPSS 22 you can download it directly from the progam menu Extra->Extension Bundles->Install and Download extension bundles. You can also download and install it manualy from here: https://www.ibm.com/developerworks/community/files/app?lang=en#/file/d0afcd4e-6d5d-4779-84ef-2b68bc81b861
Note that you must have installed "Python Essentials for SPSS" in order to get it work.
*** create the example data.
DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.
DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
END DATA.
DATASET NAME data2.
I ran into problems when using capital letters in variable names while using the "STATS CARTPROD" extension. It is also important that "classnbr" has different variable names in both datasets.
*** create cartesian product using the STATS CARTPROD extension.
DATASET ACTIVATE data1.
STATS CARTPROD INPUT2=data2
VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = cart.
EXECUTE.
*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.
DELETE VARIABLES classnbr2.
Now include the cases which have no match in data2.
*** create left outer join
* assuming both data sets are ordered by classnbr1 and fact1
ADD FILES
/FILE = cart
/FILE = data1
/BY classnbr1 fact1.
EXECUTE.
DATASET NAME outer_join.
DATASET ACTIVATE outer_join.
COMPUTE select=1.
IF (length(fact2)=0 AND classnbr1=LAG(classnbr1) AND fact1=LAG(fact1)) select=0.
EXECUTE.
SELECT IF select = 1.
EXECUTE.
DELETE VARIABLES select.
However you might get into some trouble when using very big data sets. In that case the cartesian product will be huge.
To alleviate this effect a bit, you can drop all the cases from the data sets wich don't have a corresponding match on the respective other data set, before producing the cartesian product.
This is, how it can be done:
*** create the example data.
*** (I added an additional case to the second data set, which will be deleted
in the result, since it has no match in the first data set)
DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.
DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
4 XY
END DATA.
DATASET NAME data2.
*** select cases who (don't) have a matching correspondent in the other dataset
** Create a list of unique key values of data set data2
** (In this Example the key Value is classnbr2).
DATASET ACTIVATE data2.
DATASET COPY data2_keylist.
DATASET ACTIVATE data2_keylist.
* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr2 = LAG(classnbr2) list = 0.
SELECT IF list=1.
EXECUTE.
* Delete all variables except the (now unique) key value
MATCH FILES
/FILE *
/KEEP classnbr2.
EXECUTE.
** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data1.
MATCH FILES
/FILE *
/TABLE data2_keylist
/RENAME classnbr2=classnbr1
/IN data2
/BY classnbr1.
EXECUTE.
** Remove cases from data1 who don't have a correspondent in data2
** and store them in another dataset, because we need to add them later.
DATASET COPY date1_nomatch.
SELECT IF data2=1.
EXECUTE.
DATASET ACTIVATE date1_nomatch.
SELECT IF data2=0.
EXECUTE.
** Now doing the same for the other data set.
** Create a list of unique key values of data set data1
** (In this Example the key Value is classnbr1).
DATASET ACTIVATE data1.
DATASET COPY data1_keylist.
DATASET ACTIVATE data1_keylist.
* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr1 = LAG(classnbr1) list = 0.
SELECT IF list=1.
EXECUTE.
* Delete all variables except the (now unique) key value
MATCH FILES
/FILE *
/KEEP classnbr1.
EXECUTE.
** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data2.
MATCH FILES
/FILE *
/TABLE data1_keylist
/RENAME classnbr1=classnbr2
/IN data1
/BY classnbr2.
EXECUTE.
** Remove cases from data1 who don't have a correspondent in data2.
SELECT IF data1=1.
EXECUTE.
*** create a cartesian product of the two reduced datasets.
DATASET ACTIVATE data1.
STATS CARTPROD INPUT2=data2
VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = outer_join.
EXECUTE.
*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.
DELETE VARIABLES classnbr2.
*** create left outer join by adding the cases from date1_nomatch.
DATASET ACTIVATE outer_join.
ADD FILES
/FILE = *
/FILE = date1_nomatch
/BY classnbr1 fact1
/DROP data2.
EXECUTE.
* Some cleaning up.
DATASET CLOSE data1_keylist.
DATASET CLOSE date1_nomatch.
DATASET CLOSE data2_keylist.

Related

How to combine query results?

I have three queries that are tied together. The final output requires multiple loops over the queries. This way works just fine but seems very inefficient and too complex in my opinion. Here is what I have:
Query 1:
<cfquery name="qryTypes" datasource="#application.datasource#">
SELECT
t.type_id,
t.category_id,
c.category_name,
s.type_shortcode
FROM type t
INNER JOIN section s
ON s.type_id = t.type_id
INNER JOIN category c
ON c.category_id = t.category_id
WHERE t.rec_id = 45 -- This parameter is passed from form field.
ORDER BY s.type_name,c.category_name
</cfquery>
Query Types will produce this set of results:
4 11 SP PRES
4 12 CH PRES
4 13 MS PRES
4 14 XN PRES
Then loop over query Types and get the records from another query for each record that match:
Query 2:
<cfloop query="qryTypes">
<cfquery name="qryLocation" datasource=#application.datasource#>
SELECT l.location_id, l.spent_amount
FROM locations l
WHERE l.location_type = '#trim(category_name)#'
AND l.nofa_id = 45 -- This is form field
AND l.location_id = '#trim(category_id)##trim(type_id)#'
GROUP BY l.location_id,l.spent_amount
ORDER BY l.location_id ASC
</cfquery>
<cfset spent_total = arraySum(qryLocation['spent_amount']) />
<cfset amount_total = 0 />
<cfloop query="qryLocation">
<cfquery name="qryFunds" datasource=#application.datasource#>
SELECT sum(budget) AS budget
FROM funds f
WHERE f.location_id= '#qryLocation.location_id#'
AND nofa_id = 45
</cfquery>
<cfscript>
if(qryFunds.budgetgt 0) {
amount_total = amount_total + qryFunds.budget;
}
</cfscript>
</cfloop>
<cfset GrandTotal = GrandTotal + spent_total />
<cfset GrandTotalad = GrandTotalad + amount_total />
</cfloop>
After the loops are completed this is result:
CATEGORY NAME SPENT TOTAL AMOUNT TOTAL
SP 970927 89613
CH 4804 8759
MS 9922 21436
XN 39398 4602
Grand Total: 1025051 124410
Is there a good way to merge this together and have only one query instead of three queries and inner loops? I was wondering if this might be a good fit for a stored procedure and then do all data manipulations in there? If anyone have suggestions please let me know.
qryTypes returns X records
qryLocation returns Y records
So far you've run (1 + X) queries.
qryFunds returns Z records
Now you've run (1 + X)(Y) queries.
The more data each returns, the more queries you'll run. Obviously not good.
If all you want is the final totals for each category, in a stored procedure, you could create a temp table with the joined data from qryTypes and qryLocation. Then your last qryFunds is just joined against that temp table data.
SELECT
sum(budget) AS budget
FROM
funds f
INNER JOIN
#TEMP_TABLE t ON t.location_id = f.location_id
AND
nofa_id = 45
You could then get other sums off the temp table if needed. It's possible this could all be worked into a single query, but maybe this helps you get there.
Also, a stored procedure can return multiple record sets, so you can have one return the aggregated table amount data and a 2nd return the grand total. This would keep all the calculations on the database and no need for CF to be involved.

SPSS Restructure Data

I have data in the following format:
ID Var1
1 a
1 a
1 b
1 b
2 c
2 c
2 c
I'd like to convert it (restructure it) to the following format in SPSS:
ID Var1_1 Var1_2 Var1_3 Total_Count
1 n(a)=2 n(b)=2 n( c )=0 4
2 n(a)=0 n(b)=0 n( c )=3 3
First I'll create some fake data to work with:
data list list/ID (f1) Var1 (a1).
begin data
1 a
1 a
1 b
1 b
2 c
2 c
2 c
3 b
3 c
3 c
3 c
end data.
dataset name ex.
Now you can run the following - aggregate, restructure, create the string with the counts:
aggregate outfile=* /break ID Var1/n=n.
sort cases by ID Var1.
casestovars /id=ID /index=var1.
recode a b c (miss=0).
string Var1_1 Var1_2 Var1_3 (a10).
do repeat abc=a b c/Var123=Var1_1 Var1_2 Var1_3/val="a" "b" "c".
compute Var123=concat("n(", val, ")=", ltrim(string(abc, f3))).
end repeat.
compute total_count=sum(a, b, c).
If you're doing this in SPSS Modeler, here is a stream image that works for this. The order is:
Create Data Set using User Input node, setting ID to integer and Var1 to string
Restructure by Var1 values to generate field Var1_a, Var1_b, and Var1_c
Aggregate using key field ID to sum counts Var1_a, Var1_b, and Var1_c, allowing Record Count field to be generated
Output to Table
Restructure and Aggregate in SPSS Modeler
The transpose node comes in handy if you use version 18.1.
As it is a simple pivot, you can go to "Fields and Records", then place the ID in "Index", Var1 in "Fields" and see if you can add another field for Count aggregation. If not, just derive it.

SAS base programming

Is there a way to join/merge two datasets/tables in which one register in dataset B refers at the same time to a row (condition 1) and to a column (condition 2) of dataset A?:
Condition 1: b.City = b.getColumnName() AND
Condition 2: b.Part_code = a.Part_code
What I am looking for would be something equivalent to the getColumnName(), to be able to make the comparison at the same time by row and by column.
Datasets are as follows (simplified examples):
Dataset A:
Part_code Miami LA
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
Dataset B:
Part_code City
A_1 Miami
Desired output (joined):
Part_code City Part_stock
A_1 Miami 60000
Thank you very much in advance!
What you are really looking to do is pivot the A data set and then filter it based on the cities in the B data set.
Proc Transpose to pivot the table:
proc sort data=a;
by part_code;
run;
proc transpose data=A out=A(rename=(_name_=city col1=part_stock));
by part_code;
run;
Then use an inner join to filter based on B
Proc sql noprint;
create table want as
select a.*
from A as a
inner join
B as b
on a.part_code = b.part_code
and a.city = b.city;
quit;
DomPazz's answer is the better solution because the parts table should be restructured to better handle lookups like this.
With that stated, here's a solution that uses the existing table structure. Note that both tables A and B must be sorted by Part_Code first.
data want;
merge
B (in=b)
A (in=a)
;
by Part_Code;
if a & b;
array invent(*) Miami--LA;
do i = 1 to dim(invent);
if vname(invent(i)) = City then do;
stock = invent(i);
output;
end;
end;
keep Part_Code City stock;
run;
One other option, VVALUEX will look up a columns value based on the name.
VVALUEX cannot be used in a SQL query though, if that matters.
data tableA;
infile cards truncover;
input Part_code $ Miami LA;
cards;
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
;
data tableB;
infile cards truncover;
input Part_code $ City $;
cards;
A_1 Miami
;
run;
proc sort data=tableA;
by part_code;
run;
proc sort data=tableB;
by part_code;
run;
data want;
merge tableB (in=B) tableA (in=A);
by part_code;
if B;
Value=input(vvaluex(City), best32.);
keep part_code city value;
run;

Skewed dataset join in Spark?

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
Add random element to large RDD and create new join key with it
Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
Join RDDs on new join key which will now be distributed better due to random seeding
Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.
i.e. select A.id from A join B on A.id = B.id
There are two basic approaches to solve the skew join issue:
Approach 1:
Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data.
In the above example. query will become -
1. select A.id from A join B on A.id = B.id where A.id <> 1;
2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.
Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
The partial results of the two queries can then be merged to get the final results.
Approach 2:
Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column.
Steps:
Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
*A.id = B.id && A.skewLeft = B.skewRight*
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Depending on the particular kind of skew you're experiencing, there may be different ways to solve it. The basic idea is:
Modify your join column, or create a new join column, that is not skewed but which still retains adequate information to do the join
Do the join on that non-skewed column -- resulting partitions will not be skewed
Following the join, you can update the join column back to your preferred format, or drop it if you created a new column
The "Fighting the Skew In Spark" article referenced in LiMuBei's answer is a good technique if the skewed data participates in the join. In my case, skew was caused by a very large number of null values in the join column. The null values were not participating in the join, but since Spark partitions on the join column, the post-join partitions were very skewed as there was one gigantic partition containing all of the nulls.
I solved it by adding a new column which changed all null values to a well-distributed temporary value, such as "NULL_VALUE_X", where X is replaced by random numbers between say 1 and 10,000, e.g. (in Java):
// Before the join, create a join column with well-distributed temporary values for null swids. This column
// will be dropped after the join. We need to do this so the post-join partitions will be well-distributed,
// and not have a giant partition with all null swids.
String swidWithDistributedNulls = "swid_with_distributed_nulls";
int numNullValues = 10000; // Just use a number that will always be bigger than number of partitions
Column swidWithDistributedNullsCol =
when(csDataset.col(CS_COL_SWID).isNull(), functions.concat(
functions.lit("NULL_SWID_"),
functions.round(functions.rand().multiply(numNullValues)))
)
.otherwise(csDataset.col(CS_COL_SWID));
csDataset = csDataset.withColumn(swidWithDistributedNulls, swidWithDistributedNullsCol);
Then joining on this new column, and then after the join:
outputDataset.drop(swidWithDistributedNullsCol);
Taking reference from https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
below is the code for fighting the skew in spark using Pyspark dataframe API
Creating the 2 dataframes:
from math import exp
from random import randint
from datetime import datetime
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
def get_part_index(splitIndex, iterator):
for it in iterator:
yield (splitIndex, it)
num_parts = 18
# create the large skewed rdd
skewed_large_rdd = sc.parallelize(range(0,num_parts), num_parts).flatMap(lambda x: range(0, int(exp(x))))
skewed_large_rdd = skewed_large_rdd.mapPartitionsWithIndex(lambda ind, x: get_part_index(ind, x))
skewed_large_df = spark.createDataFrame(skewed_large_rdd,['x','y'])
small_rdd = sc.parallelize(range(0,num_parts), num_parts).map(lambda x: (x, x))
small_df = spark.createDataFrame(small_rdd,['a','b'])
Dividing the data into 100 bins for large df and replicating the small df 100 times
salt_bins = 100
from pyspark.sql import functions as F
skewed_transformed_df = skewed_large_df.withColumn('salt', (F.rand()*salt_bins).cast('int')).cache()
small_transformed_df = small_df.withColumn('replicate', F.array([F.lit(i) for i in range(salt_bins)]))
small_transformed_df = small_transformed_df.select('*', F.explode('replicate').alias('salt')).drop('replicate').cache()
Finally the join avoiding the skew
t0 = datetime.now()
result2 = skewed_transformed_df.join(small_transformed_df, (skewed_transformed_df['x'] == small_transformed_df['a']) & (skewed_transformed_df['salt'] == small_transformed_df['salt']) )
result2.count()
print "The direct join takes %s"%(str(datetime.now() - t0))
Apache DataFu has two methods for doing skewed joins that implement some of the suggestions in the previous answers.
The joinSkewed method does salting (adding a random number column to split the skewed values).
The broadcastJoinSkewed method is for when you can divide the dataframe into skewed and regular parts, as described in Approach 2 from the answer by moriarty007.
These methods in DataFu are useful for projects using Spark 2.x. If you are already on Spark 3, there are dedicated methods for doing skewed joins.
Full disclosure - I am a member of Apache DataFu.
You could try to repartition the "skewed" RDD to more partitions, or try to increase spark.sql.shuffle.partitions (which is by default 200).
In your case, I would try to set the number of partitions to be much higher than the number of executors.

How to check if join was 1-1 or 1-many in SAS?

Is there an efficient way in SAS to verify if a join you ran was a 1 to 1 or a 1 to many join? I often work with tables that do not have a clear unique identifier which has led me to running 1-many joins thinking they were 1-1, thus messing up my analysis.
In the simple case where I'm expecting the input datasets for a merge to be unique by some key, I will often code a simple assertion into the merge that throws an error if any duplicates are found:
Sample data:
data one;
do id=1,2,3;
output;
end;
run;
data two;
do id=1,2,2,3,4,4;
output;
end;
run;
Log:
16 data want;
17 merge one two;
18 by id;
19 if not (first.id and last.id) then put "ERROR: duplicates!" id=;
20 run;
ERROR: duplicates!id=2
ERROR: duplicates!id=2
ERROR: duplicates!id=4
ERROR: duplicates!id=4
NOTE: There were 3 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables
That doesn't tell you which dataset has duplicates (for that you need to use in= variables like Tom's answer), but it's an easy safety net to catch duplicates.
You can also just check your output dataset for duplicates after the merge, e.g.
data _null_;
set want (keep=id);
by id;
if not (first.id and last.id) then put "ERROR: Duplicate ! " id=;
run;
Duplicates are dangerous.
You can use the IN= flags, but you need to clear them.
Let's make some sample datasets.
data one;
do id=1,2,2,3;
output;
end;
run;
data two;
do id=1,1,2,2,3,3;
output;
end;
run;
Now merge them by ID. Clear the IN= variables before the MERGE statement so that the flag is not carried forward on the dataset with just a single observation.
data want ;
call missing(in1,in2);
merge one(in=in1) two (in=in2);
by id;
if not first.id and sum(of in1-in2)> 1 then put 'Multiple Merge: ' (_n_ id in1 in2) (=);
run;
Results in the LOG.
Multiple Merge: _N_=4 id=2 in1=1 in2=1
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables.
Checking before merging is a better idea... Here are two nice and easy ways to do it. (Supposing we have a dataset named one with column id to be used for the merge).
Identify duplicate id's with PROC FREQ
proc freq data = one noprint;
table id /out = freqs_id_one(where=(count>1));
run;
Sort dataset using nodupkey
...redirecting duplicate id's in a distinct dataset:
proc sort data=one nodupkey out=one_nodupids dupout=one_dupids;
by id;
run;
Checking after-the-fact
If you realize too late that you didn't check for dupes (doh!), you can obtain the frequencies of the id with PROC FREQ (same code as above) or with a PROC SQL query:
proc sql;
select id,
count(id) as count
from merged_dataset
group by id
having count > 1;
quit;

Resources