Spark: Join dataframe column with an array - join

I have two DataFrames with two columns
df1 with schema (key1:Long, Value)
df2 with schema (key2:Array[Long], Value)
I need to join these DataFrames on the key columns (find matching values between key1 and values in key2). But the problem is that they have not the same type. Is there a way to do this?

The best way to do this (and the one that doesn't require any casting or exploding of dataframes) is to use the array_contains spark sql expression as shown below.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val df1 = Seq((1L,"one.df1"), (2L,"two.df1"),(3L,"three.df1")).toDF("key1","Value")
val df2 = Seq((Array(1L,1L),"one.df2"), (Array(2L,2L),"two.df2"), (Array(3L,3L),"three.df2")).toDF("key2","Value")
val joinedRDD = df1.join(df2, expr("array_contains(key2, key1)")).show
+----+---------+------+---------+
|key1| Value| key2| Value|
+----+---------+------+---------+
| 1| one.df1|[1, 1]| one.df2|
| 2| two.df1|[2, 2]| two.df2|
| 3|three.df1|[3, 3]|three.df2|
+----+---------+------+---------+
Please note that you cannot use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.

You can cast the type of key1 and key2 and then use the contains function, as follow.
val df1 = sc.parallelize(Seq((1L,"one.df1"),
(2L,"two.df1"),
(3L,"three.df1"))).toDF("key1","Value")
DF1:
+----+---------+
|key1|Value |
+----+---------+
|1 |one.df1 |
|2 |two.df1 |
|3 |three.df1|
+----+---------+
val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
(Array(2L,2L),"two.df2"),
(Array(3L,3L),"three.df2"))).toDF("key2","Value")
DF2:
+------+---------+
|key2 |Value |
+------+---------+
|[1, 1]|one.df2 |
|[2, 2]|two.df2 |
|[3, 3]|three.df2|
+------+---------+
val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))
JOIN:
+----+---------+------+---------+
|key1|Value |key2 |Value |
+----+---------+------+---------+
|1 |one.df1 |[1, 1]|one.df2 |
|2 |two.df1 |[2, 2]|two.df2 |
|3 |three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

Related

How can take n samples/events from all predicted clusters and return in form of dataframe in PySpark?

I'm following this tutorial and try to pick/select top n events, let's say 10 events/rows after assigning the predicted clusters to main df and merge them and report it in the form of the spark data frame.
Let's say we I have main dataframe df1 contains the 3 features as below:
+-----+------------+----------+----------+
| id| x| y| z|
+-----+------------+----------+----------+
| row0| -6.0776997|-2.9096103|-1.5181729|
| row1| -1.0122601| 7.322841|-5.4424076|
| row2| -8.297007| 6.3228936| 1.1672047|
| row3| -3.5071216| 4.784812|-5.4449472|
| row4| -5.122823|-3.3220499|-0.5069805|
| row5| -2.4764006| 8.255791| 4.409478|
| row6| 7.3153954| -5.079449| -7.291215|
| row7| -2.0167463| 9.303454| 7.095179|
| row8| -0.2338185| -4.892681| 2.1228876|
| row9| 6.565442| -6.855994|-6.7983212|
|row10| -5.6902847|-6.4827404|-0.9246967|
|row11|-0.017986143| 2.7632365| -8.814824|
|row12| -6.9042625|-6.1491723|-3.5354295|
|row13| -10.389865| 9.537853| 0.674591|
|row14| 3.9688683|-6.0467844| -5.462389|
|row15| -7.337052|-3.7689247| -5.261122|
|row16| -8.991589| 8.738728| 3.864116|
|row17| -0.18098584| 5.482743| -4.900118|
|row18| 3.3193955|-6.3573766| -6.978025|
|row19| -2.0266335|-3.4171724|0.48218703|
+-----+------------+----------+----------+
now I have information out of the clustering algorithm in the form of the datafarame df2 as below:
print("==========================Short report==================================== ")
n_clusters = model.summary.k
#n_clusters
print("Number of predicted clusters: " + str(n_clusters))
cluster_Sizes = model.summary.clusterSizes
#cluster_Sizes
col = ['size']
df2 = pd.DataFrame(cluster_Sizes, columns=col).sort_values(by=['size'], ascending=True) #sorting
cluster_Sizes = df2["size"].unique()
print("Size of predicted clusters: " + str(cluster_Sizes))
clusterSizes
#==========================Short report====================================
#Number of predicted clusters: 10
#Size of predicted clusters: [ 486 496 504 529 985 998 999 1003 2000]
+-----+----------+
| |prediction|
+-----+----------+
| 2| 486|
| 6| 496|
| 0| 504|
| 8| 529|
| 5| 985|
| 9| 998|
| 7| 999|
| 3| 1003|
| 1| 2000|
| 4| 2000|
+-----+----------+
So here, the index column is predicted cluster labels. I could assign the predicted cluster labels into the main dataframe but not cluster size as below:
+-----+----------+------------+----------+----------+
| id|prediction| x| y| z|
+-----+----------+------------+----------+----------+
| row0| 9| -6.0776997|-2.9096103|-1.5181729|
| row1| 4| -1.0122601| 7.322841|-5.4424076|
| row2| 1| -8.297007| 6.3228936| 1.1672047|
| row3| 4| -3.5071216| 4.784812|-5.4449472|
| row4| 3| -5.122823|-3.3220499|-0.5069805|
| row5| 1| -2.4764006| 8.255791| 4.409478|
| row6| 5| 7.3153954| -5.079449| -7.291215|
| row7| 1| -2.0167463| 9.303454| 7.095179|
| row8| 7| -0.2338185| -4.892681| 2.1228876|
| row9| 5| 6.565442| -6.855994|-6.7983212|
|row10| 3| -5.6902847|-6.4827404|-0.9246967|
|row11| 4|-0.017986143| 2.7632365| -8.814824|
|row12| 9| -6.9042625|-6.1491723|-3.5354295|
|row13| 1| -10.389865| 9.537853| 0.674591|
|row14| 2| 3.9688683|-6.0467844| -5.462389|
|row15| 9| -7.337052|-3.7689247| -5.261122|
|row16| 1| -8.991589| 8.738728| 3.864116|
|row17| 4| -0.18098584| 5.482743| -4.900118|
|row18| 2| 3.3193955|-6.3573766| -6.978025|
|row19| 7| -2.0266335|-3.4171724|0.48218703|
+-----+----------+------------+----------+----------+
Now, want to include\report top n rows of each cluster in the form of the dataframe via the following function. What I have tried till now is using (multi-)conditional filtering:
print("==========================Short report==================================== ")
n_clusters = model.summary.k
#n_clusters
print("Number of predicted clusters: " + str(n_clusters))
cluster_Sizes = model.summary.clusterSizes
#cluster_Sizes
col = ['size']
clusterSizes = pd.DataFrame(cluster_Sizes, columns=col).sort_values(by=['size'], ascending=True) #sorting
cluster_Sizes = clusterSizes["size"].unique()
print("Size of predicted clusters: " + str(cluster_Sizes))
clusterSizes
from pyspark.sql.functions import max, min
def cls_report(df):
x1=df.select([min("x")]) # will return max value of each column
x2=df.select([max("y")])
x3=df.select([max("z")])
return x1,x2,x3
#pick top out clusters with minimum instances
km_1st_cls = clusterSizes.values[0]
km_2nd_cls = clusterSizes.values[1]
km_3rd_cls = clusterSizes.values[2]
print(km_1st_cls)
print(km_2nd_cls)
print(km_3rd_cls)
#F1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[0]
F1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
F2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
F3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
L1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
L2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
L3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
T1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
T2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
T3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
print(F1)
print(F2)
print(F3)
print(L1)
print(L2)
print(L3)
print(T1)
print(T2)
print(T3)
df_anomaly_1st_cls = pddf_pred.filter(f"(prediction == {km_1st_cls})") \
.filter(f"y == {L1}") \
.filter(f"z == {T1}") \
.filter(f"x == {F1}")
display(df_anomaly_1st_cls)
I know that in KM algorithm in SciKit-learn we could use based on this post:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
But we don't have access to such labels_ in spark for the quick hack for this task.
Is there any elegant way (maybe define the function) to call it so that we can reflect the results of any clustering algorithm on the main dataframe for better reasoning?
Note: I'm not interested in hacking it by converting spark frame into Pandas datafarme using .toPandas()
update: I might need a function to automate filtering based on multi-conditions gets input dataframe and number of top cluster based on their instances and number of events/rows, in output in returns filtered/selected stacked results in which mock of the expected dataframe is following:
def filtering(df, top_out_cluster=2, top_events/rows=2):
# +-----+----------+------------+----------+----------+
# | id|prediction| x| y| z|
# +-----+----------+------------+----------+----------+
#1st top out cluster | row4| 3| -5.122823|-3.3220499|-0.5069805|
#conditions F1, L1, T1 |row10| 3| -5.6902847|-6.4827404|-0.9246967|
#2nd top out cluster | row8| 7| -0.2338185| -4.892681| 2.1228876|
#conditions F1, L1, T1 |row19| 7| -2.0266335|-3.4171724|0.48218703|
#3rd top out cluster |row18| 2| 3.3193955|-6.3573766| -6.978025|
#conditions F1, L1, T1 |row14| 2| 3.9688683|-6.0467844| -5.462389|
#1st top out cluster | row6| 5| 7.3153954| -5.079449| -7.291215|
#conditions F2, L2, T2 | row9| 5| 6.565442| -6.855994|-6.7983212|
#2nd top out cluster |row12| 9| -6.9042625|-6.1491723|-3.5354295|
#conditions F2, L2, T2 | row0| 9| -6.0776997|-2.9096103|-1.5181729|
#1st top out cluster | row1| 4| -1.0122601| 7.322841|-5.4424076|
#conditions F3, L3, T3 |row11| 4|-0.017986143| 2.7632365| -8.814824|
#2nd top out cluster |row13| 1| -10.389865| 9.537853| 0.674591|
#conditions F3, L3, T3 | row5| 1| -2.4764006| 8.255791| 4.409478|
# +-----+----------+------------+----------+----------+
as per your requirements discussed over colab, following is the code.
from pyspark.sql.window import Window
from pyspark.sql import functions as f
from pyspark.sql.functions import row_number
winspec = Window.partitionBy("prediction").orderBy("prediction")
def get_top_n_clusters(model, top_out_cluster: int):
n_clusters = model.summary.k
cluster_size = model.summary.clusterSizes
if top_out_cluster > n_clusters:
raise ValueError(f"n value cannot be greater than cluster_size")
col = ['size']
cluster_size = pd.DataFrame(cluster_size, columns=col).sort_values(by=['size'], ascending=True) #sorting
return list(cluster_size.head(top_out_cluster).index)
def filtering(df, labels: list, top_records: int):
winspec = Window.partitionBy("prediction").orderBy("prediction")
return (
df.filter(f.col("prediction").isin(labels))
.withColumn("rowNum", f.row_number().over(winspec))
.withColumn("minX", min(f.col("x")).over(winspec))
.withColumn("maxY", max(f.col("y")).over(winspec))
.withColumn("maxZ", max(f.col("z")).over(winspec))
.filter(f.col("rowNum")<=top_records)
.selectExpr("id", "prediction", "minX as x", "maxY as y", "maxZ as z")
)
cluster_labels = get_top_n_clusters(model, top_out_cluster=4)
fdf = filtering(df_pred, labels=cluster_labels, top_records=1)
fdf.show()
+-----+----------+----------+----------+-----------+
| id|prediction| x| y| z|
+-----+----------+----------+----------+-----------+
|row15| 7|-10.505684|-1.6820424|-0.32242402|
| row4| 9| -9.426199| 0.5639291| 3.5664654|
| row0| 2| -8.317323|-1.3278837|-0.33906546|
|row14| 4|-0.9338185|-3.6411285| -3.7280529|
+-----+----------+----------+----------+-----------+

Pyspark : Join 2 dataframe to get only new records from 2nd dataframe (Historisation)

I have 2 dataframe df1 and df2. I want result of this dataframe like this :
1. Take all records of df1.
2. Take only new records from df2 (records which are not available in df1)
3. Generate new dataframe of this logic
Note : Primary key is "id". I want to check on only id and not complete row. If Id is not available in df1 then only tale from df2.
df1
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 29-12-2019 | p1|
| 222| 29-12-2019 | p2|
| 333| 29-12-2019 | p3|
+----+-----+-----+---------+
df2
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 30-12-2019 | p7|
| 222| 30-12-2019 | p8|
| 444| 30-12-2019 | p0|
+----+-----+-----+---------+
Result
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 29-12-2019 | p1|
| 222| 29-12-2019 | p2|
| 333| 29-12-2019 | p3|
| 444| 30-12-2019 | p0|
+----+-----+-----+---------+
Could you please help me how to do this in pyspark. I am planning to use join.
df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])
mvv1 = df1.select("id").rdd.flatMap(lambda x: x).collect()
print(mvv1)
[111, 222, 333]
yy=",".join([str(x) for x in mvv1])
df2.registerTempTable("temp_df2")
sqlDF2 = sqlContext.sql("select * from temp_df2 where id not in ("+yy+")")
sqlDF2.show()
+---+----------+-----+
| id| time|other|
+---+----------+-----+
|444|30-12-2019| p0|
+---+----------+-----+
df1.union(sqlDF2).show()
+---+----------+-----+
| id| time|other|
+---+----------+-----+
|111|29-12-2019| p1|
|222|29-12-2019| p2|
|333|29-12-2019| p3|
|444|30-12-2019| p0|
+---+----------+-----+
Finally I wrote this code and seems working fine for 12,000,000 rows its only taking 5 minute to build. I hope it helps others :
df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])
#So this is giving me all records which are not available in df1 dataset
new_input_df = df2.join(df1, on=['id'], how='left_anti')
#Now union df1(historic reocrds) and new_input_df which contains only new
final_df = df1.union(new_input_df)
final_df.show()

How to get the index of FOREACH iterations

Within a FOREACH statement [e.g. day in range(dayX, dayY)] is there an easy way to find out the index of the iteration ?
Yes, you can.
Here is an example query that creates 8 Day nodes that contain an index and day:
WITH 5 AS day1, 12 AS day2
FOREACH (i IN RANGE(0, day2-day1) |
CREATE (:Day { index: i, day: day1+i }));
This query prints out the resulting nodes:
MATCH (d:Day)
RETURN d
ORDER BY d.index;
and here is an example result:
+--------------------------+
| d |
+--------------------------+
| Node[54]{day:5,index:0} |
| Node[55]{day:6,index:1} |
| Node[56]{day:7,index:2} |
| Node[57]{day:8,index:3} |
| Node[58]{day:9,index:4} |
| Node[59]{day:10,index:5} |
| Node[60]{day:11,index:6} |
| Node[61]{day:12,index:7} |
+--------------------------+
FOREACH does not yield the index during iteration. If you want the index you can use a combination of range and UNWIND like this:
WITH ["some", "array", "of", "things"] AS things
UNWIND range(0,size(things)-2) AS i
// Do something for each element in the array. In this case connect two Things
MERGE (t1:Thing {name:things[i]})-[:RELATED_TO]->(t2:Thing {name:things[i+1]})
This example iterates a counter i over which you can use to access the item at index i in the array.

A complex match

I got the following cypher query:
neo4j-sh$ start n=node(1344) match (n)-[t:_HAS_TRANSLATION]-(p) return t,p;
+-----------------------------------------------------------------------------------+
| t | p |
+-----------------------------------------------------------------------------------+
| :_HAS_TRANSLATION[2224]{of:"value"} | Node[1349]{language:"hi-hi",text:"(>0#"} |
| :_HAS_TRANSLATION[2223]{of:"value"} | Node[1348]{language:"es-es",text:"hembra"} |
| :_HAS_TRANSLATION[2222]{of:"value"} | Node[1347]{language:"ru-ru",text:"65=A:89"} |
| :_HAS_TRANSLATION[2221]{of:"value"} | Node[1346]{language:"en-us",text:"female"} |
| :_HAS_TRANSLATION[2220]{of:"value"} | Node[1345]{language:"it-it",text:"femmina"} |
+-----------------------------------------------------------------------------------+
and the following array:
["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"]
How can I change the query to return just one result, where 'language' is the same as the first occurrence of the array?
If the array should be
["fr-fr","jp-jp","en-us", "it-it", "de-de", "ru-ru", "hi-hi"]
I'd need to return the Node[1346], because it is the first with a match in the language array [en-us], being no entry for [fr-fr] and [jp-jp]
Thank you
Paolo
Cypher can express arrays, and indexes into them. So on one level, you could do this:
start n=node(1344)
match (n)-[t:_HAS_TRANSLATION]-(p)
where p.language = ["it-it", "en-us", "fr-fr", "de-de", "ru-ru", "hi-hi"][0] return t,p;
This is really just the same as asking for those nodes p where p.language="it-it" (the first element in your array).
Now, if what you mean is that the language attribute itself can be an array, then just treat it like one:
$ create (p { language: ['it-it', 'en-us']});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1
Properties set: 1
$ match (p) where p.language[0] = 'it-it' return p;
+-------------------------------------+
| p |
+-------------------------------------+
| Node[1]{language:["it-it","en-us"]} |
+-------------------------------------+
1 row
38 ms
Note the array brackets on p.language[0].
Finally, if what you're talking about is splitting your language parts into pieces (i.e. "en-us" = ["en", "us"]) then cypher's string processing functions are a little on the weak side, and I wouldn't try to do this. Instead, I'd pre-process that before inserting it into the graph in the first place, and query them as separate pieces.

Grouping in MDX Query

I am very newbie to MDX world..
I want to group the Columns based on only 3 rows. But, need join for 4th row also..
My query is :
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube
Output comes as :
Light(Row) | FAV(Row) | ALL(Row) | 16(Row) | 2(col)
Light(Row) | FAV(Row) | ALL(Row) | 7(Row) | 1(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
But, I want my output to be displayed as:
Light(Row) | FAV(Row) | ALL(Row) | | 3(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
i.e., I want to group my first two rows such that there is no duplicate 'ALL' in 3rd column..
Thanks in advance
Try this - using the level name Season Year with the Attribute name Season Year will pick up every member without teh ALL member:
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube
You can use this query if there is an All member on the [Item].[Count] hierarchy:
SELECT {[Measures].[Live Item Count]} DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
Crossjoin([Item].[Class].&[Light], [Item].[Style].&[Fav]),
Union(
Crossjoin({"All member of [Item].[Season Year]"}, {"All member of [Item].[Count]"}),
Crossjoin(Except([Item].[Season Year].members, {"All member of [Item].[Season Year]"}), [Item].[Count].children),
) ON ROWS
FROM Cube

Resources