Is there a way to join/merge two datasets/tables in which one register in dataset B refers at the same time to a row (condition 1) and to a column (condition 2) of dataset A?:
Condition 1: b.City = b.getColumnName() AND
Condition 2: b.Part_code = a.Part_code
What I am looking for would be something equivalent to the getColumnName(), to be able to make the comparison at the same time by row and by column.
Datasets are as follows (simplified examples):
Dataset A:
Part_code Miami LA
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
Dataset B:
Part_code City
A_1 Miami
Desired output (joined):
Part_code City Part_stock
A_1 Miami 60000
Thank you very much in advance!
What you are really looking to do is pivot the A data set and then filter it based on the cities in the B data set.
Proc Transpose to pivot the table:
proc sort data=a;
by part_code;
run;
proc transpose data=A out=A(rename=(_name_=city col1=part_stock));
by part_code;
run;
Then use an inner join to filter based on B
Proc sql noprint;
create table want as
select a.*
from A as a
inner join
B as b
on a.part_code = b.part_code
and a.city = b.city;
quit;
DomPazz's answer is the better solution because the parts table should be restructured to better handle lookups like this.
With that stated, here's a solution that uses the existing table structure. Note that both tables A and B must be sorted by Part_Code first.
data want;
merge
B (in=b)
A (in=a)
;
by Part_Code;
if a & b;
array invent(*) Miami--LA;
do i = 1 to dim(invent);
if vname(invent(i)) = City then do;
stock = invent(i);
output;
end;
end;
keep Part_Code City stock;
run;
One other option, VVALUEX will look up a columns value based on the name.
VVALUEX cannot be used in a SQL query though, if that matters.
data tableA;
infile cards truncover;
input Part_code $ Miami LA;
cards;
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
;
data tableB;
infile cards truncover;
input Part_code $ City $;
cards;
A_1 Miami
;
run;
proc sort data=tableA;
by part_code;
run;
proc sort data=tableB;
by part_code;
run;
data want;
merge tableB (in=B) tableA (in=A);
by part_code;
if B;
Value=input(vvaluex(City), best32.);
keep part_code city value;
run;
Related
In Column A I have the id of the home team, B the name of the home team, C the id of the visiting team and in D the name of the visiting team:
12345 Borac Banja Luka 98765 B36
678910 Panevezys 43214 Milsami
1112131415 Flora 7852564 SJK
1617181920 Magpies 874236551 Dila
I want to create a column of ids and another of names but keeping the sequence of who will play with whom:
12345 Borac Banja Luka
98765 B36
678910 Panevezys
43214 Milsami
1112131415 Flora
7852564 SJK
1617181920 Magpies
874236551 Dila
Currently (the model works) I'm joining the columns with a special character, using flatten and finally split:
=ARRAYFORMULA(SPLIT(FLATTEN({
FILTER(A1:A&"§§§§§"&B1:B,(A1:A<>"")*(B1:B<>"")),
FILTER(C1:C&"§§§§§"&D1:D,(C1:C<>"")*(D1:D<>""))
}),"§§§§§"))
Is there a less archaic and correct approach to working in this type of case?
Spreadsheet to tests
889
A
5687
C
532
B
8723
D
Stack up the columns using {} and SORT them by a SEQUENCE of 1,2,1,2:
=SORT({A1:B2;C1:D2},{SEQUENCE(ROWS(A1:B2));SEQUENCE(ROWS(A1:B2))},1)
889
A
5687
C
532
B
8723
D
You can also try with function QUERY, enter this formula in F1:
={QUERY((A1:B), "SELECT * WHERE A IS NOT NULL and B IS NOT NULL",1);
QUERY((C1:D), "SELECT * WHERE C IS NOT NULL and D IS NOT NULL",1)}
I am trying to do some calculations based on joined data sets.
My aim is to calculate the revenue in prices of the previous year.
The code below works for the revenue with current prices and sales.
data work.price;
input date date. car $ price;
format date date9. ;
datalines;
01Jan19 Model1 7000
01Jan19 Model2 4000
01Jan19 Model3 5000
01Jan20 Model1 7500
01Jan20 Model2 4800
01Jan20 Model3 4500
01Jan21 Model1 8000
01Jan21 Model2 5200
01Jan21 Model3 4000
run;
data work.sales;
input date date. type $ sales;
format date date9. ;
datalines;
01Jan19 A 10
01Jan19 B 4
01Jan19 C 50
01Jan20 A 18
01Jan20 B 10
01Jan20 C 16
01Jan21 A 22
01Jan21 B 8
01Jan21 C 13
run;
data work.assignment;
input car $6. type $7.;
datalines;
Model1 A
Model2 B
Model3 C
run;
proc sql ;
create table want as
select Date format date9., *,price*sales as return
from sales
natural join price
natural join assignment
;
quit;
My solution so far, was to shift the times series of the prices prior to joining.
But I wonder if this step can be done more efficiently in the proc sql statement.
data work.price;
set work.price;
Date = intnx('month',date,+12);
run;
Thanks a lot for your help!
You can do that by using an inner join and specifying the join conditions. For you this would be something like
proc sql;
create table want as
select
p.date format date9., s.type, s.sales, p.car, p.price, p.price * s.sales as return
from
sales as s
inner join
prices as p
on
intnx("year", s.date, 1) = p.date
inner join
assignment as a
on
s.type = a.type and
p.car = a.car;
quit;
Me personally, I always do explicit inner joins and do not rely on natural joins. If something is wrong with your input data, you get an error instead of a wrong result. Also, the needed columns need to be explicitly selected instead of just selecting *.
I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
Add random element to large RDD and create new join key with it
Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
Join RDDs on new join key which will now be distributed better due to random seeding
Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.
i.e. select A.id from A join B on A.id = B.id
There are two basic approaches to solve the skew join issue:
Approach 1:
Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data.
In the above example. query will become -
1. select A.id from A join B on A.id = B.id where A.id <> 1;
2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.
Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
The partial results of the two queries can then be merged to get the final results.
Approach 2:
Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column.
Steps:
Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
*A.id = B.id && A.skewLeft = B.skewRight*
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Depending on the particular kind of skew you're experiencing, there may be different ways to solve it. The basic idea is:
Modify your join column, or create a new join column, that is not skewed but which still retains adequate information to do the join
Do the join on that non-skewed column -- resulting partitions will not be skewed
Following the join, you can update the join column back to your preferred format, or drop it if you created a new column
The "Fighting the Skew In Spark" article referenced in LiMuBei's answer is a good technique if the skewed data participates in the join. In my case, skew was caused by a very large number of null values in the join column. The null values were not participating in the join, but since Spark partitions on the join column, the post-join partitions were very skewed as there was one gigantic partition containing all of the nulls.
I solved it by adding a new column which changed all null values to a well-distributed temporary value, such as "NULL_VALUE_X", where X is replaced by random numbers between say 1 and 10,000, e.g. (in Java):
// Before the join, create a join column with well-distributed temporary values for null swids. This column
// will be dropped after the join. We need to do this so the post-join partitions will be well-distributed,
// and not have a giant partition with all null swids.
String swidWithDistributedNulls = "swid_with_distributed_nulls";
int numNullValues = 10000; // Just use a number that will always be bigger than number of partitions
Column swidWithDistributedNullsCol =
when(csDataset.col(CS_COL_SWID).isNull(), functions.concat(
functions.lit("NULL_SWID_"),
functions.round(functions.rand().multiply(numNullValues)))
)
.otherwise(csDataset.col(CS_COL_SWID));
csDataset = csDataset.withColumn(swidWithDistributedNulls, swidWithDistributedNullsCol);
Then joining on this new column, and then after the join:
outputDataset.drop(swidWithDistributedNullsCol);
Taking reference from https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
below is the code for fighting the skew in spark using Pyspark dataframe API
Creating the 2 dataframes:
from math import exp
from random import randint
from datetime import datetime
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
def get_part_index(splitIndex, iterator):
for it in iterator:
yield (splitIndex, it)
num_parts = 18
# create the large skewed rdd
skewed_large_rdd = sc.parallelize(range(0,num_parts), num_parts).flatMap(lambda x: range(0, int(exp(x))))
skewed_large_rdd = skewed_large_rdd.mapPartitionsWithIndex(lambda ind, x: get_part_index(ind, x))
skewed_large_df = spark.createDataFrame(skewed_large_rdd,['x','y'])
small_rdd = sc.parallelize(range(0,num_parts), num_parts).map(lambda x: (x, x))
small_df = spark.createDataFrame(small_rdd,['a','b'])
Dividing the data into 100 bins for large df and replicating the small df 100 times
salt_bins = 100
from pyspark.sql import functions as F
skewed_transformed_df = skewed_large_df.withColumn('salt', (F.rand()*salt_bins).cast('int')).cache()
small_transformed_df = small_df.withColumn('replicate', F.array([F.lit(i) for i in range(salt_bins)]))
small_transformed_df = small_transformed_df.select('*', F.explode('replicate').alias('salt')).drop('replicate').cache()
Finally the join avoiding the skew
t0 = datetime.now()
result2 = skewed_transformed_df.join(small_transformed_df, (skewed_transformed_df['x'] == small_transformed_df['a']) & (skewed_transformed_df['salt'] == small_transformed_df['salt']) )
result2.count()
print "The direct join takes %s"%(str(datetime.now() - t0))
Apache DataFu has two methods for doing skewed joins that implement some of the suggestions in the previous answers.
The joinSkewed method does salting (adding a random number column to split the skewed values).
The broadcastJoinSkewed method is for when you can divide the dataframe into skewed and regular parts, as described in Approach 2 from the answer by moriarty007.
These methods in DataFu are useful for projects using Spark 2.x. If you are already on Spark 3, there are dedicated methods for doing skewed joins.
Full disclosure - I am a member of Apache DataFu.
You could try to repartition the "skewed" RDD to more partitions, or try to increase spark.sql.shuffle.partitions (which is by default 200).
In your case, I would try to set the number of partitions to be much higher than the number of executors.
I have a table with the following columns/values:
id group_id state type_id
1 g1 NY t1
2 g1 NY t1
3 g1 PA t1
4 g2 NY t1
5 g3 CA t1
6 g4 CA t2
I would like to identify the maximum frequency of a group of group_id, state combinations given a type_id. To do this in straight SQL I would do something like the following:
SELECT MAX (COUNT (*)) AS perGroup
FROM table_name
WHERE type_id = t1
GROUP BY group_id, state
This seems to work just fine in SQL but when I attempt to translate it to a Criteria Builder in Grails/GORM it always gives me a max of 0.
I've tried just getting the count and it seems to be working properly. However; this would require me to loop in the groovy code after the db call and I would like to avoid that if I can.
This is what I have that works for the count:
Table.createCriteria().list {
eq("type.id",'t1')
projections {
sqlProjection "count(*) as theCount",['theCount'], [INTEGER]
groupProperty "group"
groupProperty "state"
}
}
How would I identify the max instead?
For example, given the data in the above scenario I would expect the following output:
t1 = 2
t2 = 1
I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2.
Since I am very-very new to SAS, I am trying to stick to SQL logic.
So basically I did (shot version)
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
OR DS1.COL3=DS2.COL3
OR DS1.COL4=DS2.COL2
...
After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?
I would use 3 separate queries and use a UNION
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
On DS1.COL3=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
ON DS1.COL4=DS2.COL2
...
You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.
I suggest adding additional clauses to exclude null results.
Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.
Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).
EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.
First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.
Here is how you can do it.
I have assumed you match field as ID
proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;
data out;
merge DS1 DS2;
by ID;
run;
You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.