I have two rdds that I need to join them together. They look like the followings:
RDD1
[(u'2', u'100', 2),
(u'1', u'300', 1),
(u'1', u'200', 1)]
RDD2
[(u'1', u'2'), (u'1', u'3')]
My desired output is:
[(u'1', u'2', u'100', 2)]
So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys.
Thanks
Dataframe If you allow using Spark Dataframe in the solution. You can turn given RDD to dataframes and join the corresponding column together.
df1 = spark.createDataFrame(rdd1, schema=['a', 'b', 'c'])
df2 = spark.createDataFrame(rdd2, schema=['d', 'a'])
rdd_join = df1.join(df2, on='a')
out = rdd_join.rdd.collect()
RDD just zip the key that you want to join to the first element and simply use join to do the joining
rdd1_zip = rdd1.map(lambda x: (x[0], (x[1], x[2])))
rdd2_zip = rdd2.map(lambda x: (x[1], x[0]))
rdd_join = rdd1_zip.join(rdd2_zip)
rdd_out = rdd_join.map(lambda x: (x[0], x[1][0][0], x[1][0][1], x[1][1])).collect() # flatten the rdd
print(rdd_out)
For me your process looks like manual. Here is sample code:-
rdd = sc.parallelize([(u'2', u'100', 2),(u'1', u'300', 1),(u'1', u'200', 1)])
rdd1 = sc.parallelize([(u'1', u'2'), (u'1', u'3')])
newRdd = rdd1.map(lambda x:(x[1],x[0])).join(rdd.map(lambda x:(x[0],(x[1],x[2]))))
newRdd.map(lambda x:(x[1][0], x[0], x[1][1][0], x[1][1][1])).coalesce(1).collect()
OUTPUT:-
[(u'1', u'2', u'100', 2)]
Related
I have a question in relation to comparison between tables.
I want to compare data from the same table with different filter conditions:
First Version:
select *
from PPT_TIER4_FTSRB.AUTO_SOURCE_ACCOUNT
WHERE BUSINESS_DATE = DATE '2022-05-31'
AND GRAIN ='ACCOUNT'
AND LAYER = 'IDL'
AND SOURCE_CD = 'MTMB'
Second Version:
select *
from PPT_TIER4_FTSRB.AUTO_SOURCE_ACCOUNT
WHERE BUSINESS_DATE = DATE '2022-05-31'
AND GRAIN ='ACCOUNT'
AND LAYER = 'ACQ'
AND SOURCE_CD = 'MTMB'
As you can see the only difference between the two is the LAYER = IDL in first version and ACQ
I wanted to see which records match betweeen the two excluding the column Layer(Because they would always be different.
I tried to do an inner join, but it keeps running for very long:
SELECT *
FROM
( select *
from PPT_TIER4_FTSRB.AUTO_SOURCE_ACCOUNT
WHERE BUSINESS_DATE = DATE '2022-05-31'
AND GRAIN ='ACCOUNT'
AND LAYER = 'IDL'
AND SOURCE_CD = 'MTMB'
) A
INNER JOIN
( select *
from PPT_TIER4_FTSRB.AUTO_SOURCE_ACCOUNT
WHERE BUSINESS_DATE = DATE '2022-05-31'
AND GRAIN ='ACCOUNT'
AND LAYER = 'ACQ'
AND SOURCE_CD = 'MTMB'
) B
ON A.BUSINESS_DATE = B.BUSINESS_DATE
AND A.GRAIN =B.GRAIN
AND A.SOURCE_CD = B.SOURCE_CD
This is because a join for your purposes would need a 1:1 relationship between the rows being joined. You don't appear to have that, and haven't given any example data for us to derive one.
For example:
sample 1 has rows 1, 2, 3
sample 2 has rows a, b, c
your results give 1a,1b,1c,2a,2b,2c,3a,3b,3c
That's effectively a CROSS JOIN, which happens because the columns you're joining on are always the same on every row.
My advice would be to select all the rows in question and Sort them. Then visually see if there are any patterns you want to analyse with joins or aggregates...
SELECT *
FROM ppt_tier4_ftsrb.auto_source_account
WHERE business_date = DATE '2022-05-31'
AND grain ='ACCOUNT'
AND layer IN ('ACQ', 'IDL')
AND source_cd = 'MTMB'
ORDER BY layer, and, some, other, columns
I have two data frames with columns of interest 'ParseCom', which is the left index of this fuzzy join, and 'REF' which should be a substring of 'ParseCom' during a join.
This is iterating over the Dataframe, which is not recommended.
How can I implement a fuzzy join in Dask where I am joining on substrings?
for i, com in enumerate(defects['ParseCom']):
for j, sub in enumerate(repair_matrix['REF']):
if sub in com:
print(i,j, com)
Modifying the approach shown Merge pandas on string contains
Using pandas:
def inmerge(sub,sup, sub_on, sup_on, sub_index, sup_index):
sub_par = sub.rename(columns = {sub_on: 'on', sub_index: 'common'})
sup_par = sup.rename(columns = {sup_on: 'on', sup_index: 'common'})
print(sub_par.columns.tolist())
print(sup_par.columns.tolist())
rhs = (sub_par.on
.apply(lambda x: sup_par[sup_par.on.str.find(x).ge(0)].common)
.bfill(axis=1)
.iloc[:, 0])
rel = (pd.concat([sub_par.common, rhs], axis=1, ignore_index=True).rename(columns={0: sub_index, 1: sup_index}))[[sub_index,sup_index]]
print(rel.columns.tolist())
print(sub_index,sup_index)
sub = sub.merge(rel, on = sub_index)
return sub.merge(sup, on = sup_index)
This has limitations, such as requiring pandas and speed, but it does work faster than a for loop.
I have two dataframes, left and right. The latter, right, is a subset of left, such that left contains all the rows right does. I want to use right to remove redundant rows from left by doing a simple "left_anti" join.
I've discovered that the join doesn't work if I use a filtered version of left on the right. It works only if I reconstruct the right dataframe from scratch.
What is going on here?
Is there a workaround that doesn't involve recreating the right dataframe?
from pyspark.sql import Row, SparkSession
import pyspark.sql.types as t
schema = t.StructType(
[
t.StructField("street_number", t.IntegerType()),
t.StructField("street_name", t.StringType()),
t.StructField("lower_street_number", t.IntegerType()),
t.StructField("upper_street_number", t.IntegerType()),
]
)
data = [
# Row that conflicts w/ range row, and should be removed
Row(
street_number=123,
street_name="Main St",
lower_street_number=None,
upper_street_number=None,
),
# Range row
Row(
street_number=None,
street_name="Main St",
lower_street_number=120,
upper_street_number=130,
),
]
def join_files(left_side, right_side):
join_condition = [
(
(right_side.lower_street_number.isNotNull())
& (right_side.upper_street_number.isNotNull())
& (right_side.lower_street_number <= left_side.street_number)
& (right_side.upper_street_number >= left_side.street_number)
)
]
return left_side.join(right_side, join_condition, "left_anti")
spark = SparkSession.builder.getOrCreate()
left = spark.createDataFrame(data, schema)
right_fail = left.filter("lower_street_number IS NOT NULL")
result = join_files(left, right_fail)
result.count() # Returns 2 - both rows still present
right_success = spark.createDataFrame([data[1]], schema)
result = join_files(left, right_success)
result.count() # Returns 1 - the "left_anti" join worked as expected
You could alias the DF's:
import pyspark.sql.functions as F
def join_files(left_side, right_side):
join_condition = [
(
(F.col("right_side.lower_street_number").isNotNull())
& (F.col("right_side.upper_street_number").isNotNull())
& (F.col("right_side.lower_street_number") <= F.col("left_side.street_number"))
& (F.col("right_side.upper_street_number") >= F.col("left_side.street_number"))
)
]
return left_side.join(right_side, join_condition, "left_anti")
spark = SparkSession.builder.getOrCreate()
left = spark.createDataFrame(data, schema).alias("left_side")
right_fail = left.filter("lower_street_number IS NOT NULL").alias("right_side")
result = join_files(left, right_fail)
print(result.count()) # Returns 2 - both rows still present
right_success = spark.createDataFrame([data[1]], schema).alias("right_side")
result = join_files(left, right_success)
result.count() # Returns 1 - the "left_anti" join worked as expected
Don't know which pyspark version you are on but pyspark==3.0.1, I get the following explanatory error.
AnalysisException: Column lower_street_number#522, upper_street_number#523, lower_street_number#522, upper_street_number#523 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
I've used query a lot of times to filter rows, like select * WHERE R = "ColContas" in this case filtering rows and selecting only the ones where R has the "ColContas" tag.
This time I need to filter columns, I just want to select the columns where in line 5 there is "ColContas" or "ColNum". That is to say, this:
Should turn into this:
Should be something like select * WHERE “row5” = “ColContas” OR “row5” = “ColNum”
How could I do that?
try:
=TRANSPOSE(QUERY(TRANSPOSE(A5:G); "where Col1 matches 'ColContas|ColNum'"; 1))
I am new in Pig and I have two data sets, "highspender" and "feedback".
Highspender:
Price,fname,lname
$50,Jack,Brown
$30,Rovin,Pall
Feedback:
date,Name,rate
2015-01-02,Jack B Brown,5
2015-01-02,Pall,4
Now I have to join these two datasets on the basis of their name. My condition should be fname or lname of Highspender should match with the Name of feedback. How to join these two datasets? Any idea?
You can try below script to do the same all you need is to replace the names according to your data
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
out = JOIN highs BY fname, feedback BY Name;
out1 = JOIN highs BY lname, feedback BY Name;
final_out = UNION out,out1;
For further help you can refer this Pig Reference manual
EDIT
As per the comment script for joining data with string function is as bellow:
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
crossout = cross highs, feedback;
final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name);
final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name);
final = UNION final_lname, final_fname;