Apache Spark Moving Average - time-series

I have a huge file in HDFS having Time Series data points (Yahoo Stock prices).
I want to find the moving average of the Time Series how do I go about writing the Apache Spark job to do that .

You can use the sliding function from MLLIB which probably does the same thing as Daniel's answer. You will have to sort the data by time before using the sliding function.
import org.apache.spark.mllib.rdd.RDDFunctions._
sc.parallelize(1 to 100, 10)
.sliding(3)
.map(curSlice => (curSlice.sum / curSlice.size))
.collect()

Moving average is a tricky problem for Spark, and any distributed system. When the data is spread across multiple machines, there will be some time windows that cross partitions. We have to duplicate the data at the start of the partitions, so that calculating the moving average per partition gives complete coverage.
Here is a way to do this in Spark. The example data:
val ts = sc.parallelize(0 to 100, 10)
val window = 3
A simple partitioner that puts each row in the partition we specify by the key:
class StraightPartitioner(p: Int) extends org.apache.spark.Partitioner {
def numPartitions = p
def getPartition(key: Any) = key.asInstanceOf[Int]
}
Create the data with the first window - 1 rows copied to the previous partition:
val partitioned = ts.mapPartitionsWithIndex((i, p) => {
val overlap = p.take(window - 1).toArray
val spill = overlap.iterator.map((i - 1, _))
val keep = (overlap.iterator ++ p).map((i, _))
if (i == 0) keep else keep ++ spill
}).partitionBy(new StraightPartitioner(ts.partitions.length)).values
Just calculate the moving average on each partition:
val movingAverage = partitioned.mapPartitions(p => {
val sorted = p.toSeq.sorted
val olds = sorted.iterator
val news = sorted.iterator
var sum = news.take(window - 1).sum
(olds zip news).map({ case (o, n) => {
sum += n
val v = sum
sum -= o
v
}})
})
Because of the duplicate segments this will have no gaps in coverage.
scala> movingAverage.collect.sameElements(3 to 297 by 3)
res0: Boolean = true

Spark 1.4 introduced windowing functions, which means that you can do moving average as follows adjust windowing with rowsBetween:
val schema = Seq("id", "cykle", "value")
val data = Seq(
(1, 1, 1),
(1, 2, 11),
(1, 3, 1),
(1, 4, 11),
(1, 5, 1),
(1, 6, 11),
(2, 1, 1),
(2, 2, 11),
(2, 3, 1),
(2, 4, 11),
(2, 5, 1),
(2, 6, 11)
)
val dft = sc.parallelize(data).toDF(schema: _*)
dft.select('*).show
// PARTITION BY id ORDER BY cykle ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING (5)
val w = Window.partitionBy("id").orderBy("cykle").rowsBetween(-2, 2)
val x = dft.select($"id",$"cykle",avg($"value").over(w))
x.show
Output (in zeppelin):
schema: Seq[String] = List(id, cykle, value)
data: Seq[(Int, Int, Int)] = List((1,1,1), (1,2,11), (1,3,1), (1,4,11), (1,5,1), (1,6,11), (2,1,1), (2,2,11), (2,3,1), (2,4,11), (2,5,1), (2,6,11))
dft: org.apache.spark.sql.DataFrame = [id: int, cykle: int, value: int]
+---+-----+-----+
| id|cykle|value|
+---+-----+-----+
| 1| 1| 1|
| 1| 2| 11|
| 1| 3| 1|
| 1| 4| 11|
| 1| 5| 1|
| 1| 6| 11|
| 2| 1| 1|
| 2| 2| 11|
| 2| 3| 1|
| 2| 4| 11|
| 2| 5| 1|
| 2| 6| 11|
+---+-----+-----+
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#55cd666f
x: org.apache.spark.sql.DataFrame = [id: int, cykle: int, 'avg(value) WindowSpecDefinition ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING: double]
+---+-----+-------------------------------------------------------------------------+
| id|cykle|'avg(value) WindowSpecDefinition ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING|
+---+-----+-------------------------------------------------------------------------+
| 1| 1| 4.333333333333333|
| 1| 2| 6.0|
| 1| 3| 5.0|
| 1| 4| 7.0|
| 1| 5| 6.0|
| 1| 6| 7.666666666666667|
| 2| 1| 4.333333333333333|
| 2| 2| 6.0|
| 2| 3| 5.0|
| 2| 4| 7.0|
| 2| 5| 6.0|
| 2| 6| 7.666666666666667|
+---+-----+————————————————————————————————————+

Related

Google Sheet - Return unique id with maximum mark associated

Good day
I'm trying to return the maximum mark of each student. If the fails the training, they can try a new attempt, and my summary sheet should include only unique value with highest mark obtained.
Example of data:
| | A |B |
| -| - | - |
|1 |email | score|
|2|abc#mail.com | 1|
|3 |abd#mail.com | 3|
|4 |abc#mail.com | 3|
|5 |abc#mail.com | 4|
|6 |abe#mail.com | 5|
|7 |abe#mail.com | 4|
|8 |abe#mail.com | 7|
|9 |jvr#mail.com | 1|
|10 |jvr#mail.com | 7|
And i would like to return this table:
| | D |E |
|- | - | - |
|1 |email | score|
|2| abc#mail.com | 4|
|3 |abd#mail.com | 3|
|4 |abe#mail.com | 7|
|5 |jvr#mail.com | 7|
Code used in COL D2: <br>
=UNIQUE(A2:A,FALSE,FALSE)<br>
Code used in COL E2: <br>
=if(G2<>"", ARRAYFORMULA(VLOOKUP(G2,D2:E,2,false)),"")<br>
Code used in COL E3: <br>
=if(G3<>"", ARRAYFORMULA(VLOOKUP(G3,D3:E,2,false)),"")<br>
Is there any way to optimize this?
In D2 paste
=UNIQUE(A2:A)
In E2 paste this formula
=MAX(TRANSPOSE(FILTER($B$2:$B, $A$2:$A=D2)))
use:
=SORTN(SORT(A2:B, 2, 0), 9^9, 2, 1, 1)
A2:B range
2 sort by 2nd column
0 in descending order
9^9 return all rows
2 group by
1 first column
1 in ascending order

How can take n samples/events from all predicted clusters and return in form of dataframe in PySpark?

I'm following this tutorial and try to pick/select top n events, let's say 10 events/rows after assigning the predicted clusters to main df and merge them and report it in the form of the spark data frame.
Let's say we I have main dataframe df1 contains the 3 features as below:
+-----+------------+----------+----------+
| id| x| y| z|
+-----+------------+----------+----------+
| row0| -6.0776997|-2.9096103|-1.5181729|
| row1| -1.0122601| 7.322841|-5.4424076|
| row2| -8.297007| 6.3228936| 1.1672047|
| row3| -3.5071216| 4.784812|-5.4449472|
| row4| -5.122823|-3.3220499|-0.5069805|
| row5| -2.4764006| 8.255791| 4.409478|
| row6| 7.3153954| -5.079449| -7.291215|
| row7| -2.0167463| 9.303454| 7.095179|
| row8| -0.2338185| -4.892681| 2.1228876|
| row9| 6.565442| -6.855994|-6.7983212|
|row10| -5.6902847|-6.4827404|-0.9246967|
|row11|-0.017986143| 2.7632365| -8.814824|
|row12| -6.9042625|-6.1491723|-3.5354295|
|row13| -10.389865| 9.537853| 0.674591|
|row14| 3.9688683|-6.0467844| -5.462389|
|row15| -7.337052|-3.7689247| -5.261122|
|row16| -8.991589| 8.738728| 3.864116|
|row17| -0.18098584| 5.482743| -4.900118|
|row18| 3.3193955|-6.3573766| -6.978025|
|row19| -2.0266335|-3.4171724|0.48218703|
+-----+------------+----------+----------+
now I have information out of the clustering algorithm in the form of the datafarame df2 as below:
print("==========================Short report==================================== ")
n_clusters = model.summary.k
#n_clusters
print("Number of predicted clusters: " + str(n_clusters))
cluster_Sizes = model.summary.clusterSizes
#cluster_Sizes
col = ['size']
df2 = pd.DataFrame(cluster_Sizes, columns=col).sort_values(by=['size'], ascending=True) #sorting
cluster_Sizes = df2["size"].unique()
print("Size of predicted clusters: " + str(cluster_Sizes))
clusterSizes
#==========================Short report====================================
#Number of predicted clusters: 10
#Size of predicted clusters: [ 486 496 504 529 985 998 999 1003 2000]
+-----+----------+
| |prediction|
+-----+----------+
| 2| 486|
| 6| 496|
| 0| 504|
| 8| 529|
| 5| 985|
| 9| 998|
| 7| 999|
| 3| 1003|
| 1| 2000|
| 4| 2000|
+-----+----------+
So here, the index column is predicted cluster labels. I could assign the predicted cluster labels into the main dataframe but not cluster size as below:
+-----+----------+------------+----------+----------+
| id|prediction| x| y| z|
+-----+----------+------------+----------+----------+
| row0| 9| -6.0776997|-2.9096103|-1.5181729|
| row1| 4| -1.0122601| 7.322841|-5.4424076|
| row2| 1| -8.297007| 6.3228936| 1.1672047|
| row3| 4| -3.5071216| 4.784812|-5.4449472|
| row4| 3| -5.122823|-3.3220499|-0.5069805|
| row5| 1| -2.4764006| 8.255791| 4.409478|
| row6| 5| 7.3153954| -5.079449| -7.291215|
| row7| 1| -2.0167463| 9.303454| 7.095179|
| row8| 7| -0.2338185| -4.892681| 2.1228876|
| row9| 5| 6.565442| -6.855994|-6.7983212|
|row10| 3| -5.6902847|-6.4827404|-0.9246967|
|row11| 4|-0.017986143| 2.7632365| -8.814824|
|row12| 9| -6.9042625|-6.1491723|-3.5354295|
|row13| 1| -10.389865| 9.537853| 0.674591|
|row14| 2| 3.9688683|-6.0467844| -5.462389|
|row15| 9| -7.337052|-3.7689247| -5.261122|
|row16| 1| -8.991589| 8.738728| 3.864116|
|row17| 4| -0.18098584| 5.482743| -4.900118|
|row18| 2| 3.3193955|-6.3573766| -6.978025|
|row19| 7| -2.0266335|-3.4171724|0.48218703|
+-----+----------+------------+----------+----------+
Now, want to include\report top n rows of each cluster in the form of the dataframe via the following function. What I have tried till now is using (multi-)conditional filtering:
print("==========================Short report==================================== ")
n_clusters = model.summary.k
#n_clusters
print("Number of predicted clusters: " + str(n_clusters))
cluster_Sizes = model.summary.clusterSizes
#cluster_Sizes
col = ['size']
clusterSizes = pd.DataFrame(cluster_Sizes, columns=col).sort_values(by=['size'], ascending=True) #sorting
cluster_Sizes = clusterSizes["size"].unique()
print("Size of predicted clusters: " + str(cluster_Sizes))
clusterSizes
from pyspark.sql.functions import max, min
def cls_report(df):
x1=df.select([min("x")]) # will return max value of each column
x2=df.select([max("y")])
x3=df.select([max("z")])
return x1,x2,x3
#pick top out clusters with minimum instances
km_1st_cls = clusterSizes.values[0]
km_2nd_cls = clusterSizes.values[1]
km_3rd_cls = clusterSizes.values[2]
print(km_1st_cls)
print(km_2nd_cls)
print(km_3rd_cls)
#F1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[0]
F1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
F2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
F3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[0].select("min(x)").rdd.flatMap(list).collect()[0]
L1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
L2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
L3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[1].select("max(y)").rdd.flatMap(list).collect()[0]
T1 = cls_report(pddf_pred.filter(f"prediction == {km_1st_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
T2 = cls_report(pddf_pred.filter(f"prediction == {km_2nd_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
T3 = cls_report(pddf_pred.filter(f"prediction == {km_3rd_cls}"))[2].select("max(z)").rdd.flatMap(list).collect()[0]
print(F1)
print(F2)
print(F3)
print(L1)
print(L2)
print(L3)
print(T1)
print(T2)
print(T3)
df_anomaly_1st_cls = pddf_pred.filter(f"(prediction == {km_1st_cls})") \
.filter(f"y == {L1}") \
.filter(f"z == {T1}") \
.filter(f"x == {F1}")
display(df_anomaly_1st_cls)
I know that in KM algorithm in SciKit-learn we could use based on this post:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
But we don't have access to such labels_ in spark for the quick hack for this task.
Is there any elegant way (maybe define the function) to call it so that we can reflect the results of any clustering algorithm on the main dataframe for better reasoning?
Note: I'm not interested in hacking it by converting spark frame into Pandas datafarme using .toPandas()
update: I might need a function to automate filtering based on multi-conditions gets input dataframe and number of top cluster based on their instances and number of events/rows, in output in returns filtered/selected stacked results in which mock of the expected dataframe is following:
def filtering(df, top_out_cluster=2, top_events/rows=2):
# +-----+----------+------------+----------+----------+
# | id|prediction| x| y| z|
# +-----+----------+------------+----------+----------+
#1st top out cluster | row4| 3| -5.122823|-3.3220499|-0.5069805|
#conditions F1, L1, T1 |row10| 3| -5.6902847|-6.4827404|-0.9246967|
#2nd top out cluster | row8| 7| -0.2338185| -4.892681| 2.1228876|
#conditions F1, L1, T1 |row19| 7| -2.0266335|-3.4171724|0.48218703|
#3rd top out cluster |row18| 2| 3.3193955|-6.3573766| -6.978025|
#conditions F1, L1, T1 |row14| 2| 3.9688683|-6.0467844| -5.462389|
#1st top out cluster | row6| 5| 7.3153954| -5.079449| -7.291215|
#conditions F2, L2, T2 | row9| 5| 6.565442| -6.855994|-6.7983212|
#2nd top out cluster |row12| 9| -6.9042625|-6.1491723|-3.5354295|
#conditions F2, L2, T2 | row0| 9| -6.0776997|-2.9096103|-1.5181729|
#1st top out cluster | row1| 4| -1.0122601| 7.322841|-5.4424076|
#conditions F3, L3, T3 |row11| 4|-0.017986143| 2.7632365| -8.814824|
#2nd top out cluster |row13| 1| -10.389865| 9.537853| 0.674591|
#conditions F3, L3, T3 | row5| 1| -2.4764006| 8.255791| 4.409478|
# +-----+----------+------------+----------+----------+
as per your requirements discussed over colab, following is the code.
from pyspark.sql.window import Window
from pyspark.sql import functions as f
from pyspark.sql.functions import row_number
winspec = Window.partitionBy("prediction").orderBy("prediction")
def get_top_n_clusters(model, top_out_cluster: int):
n_clusters = model.summary.k
cluster_size = model.summary.clusterSizes
if top_out_cluster > n_clusters:
raise ValueError(f"n value cannot be greater than cluster_size")
col = ['size']
cluster_size = pd.DataFrame(cluster_size, columns=col).sort_values(by=['size'], ascending=True) #sorting
return list(cluster_size.head(top_out_cluster).index)
def filtering(df, labels: list, top_records: int):
winspec = Window.partitionBy("prediction").orderBy("prediction")
return (
df.filter(f.col("prediction").isin(labels))
.withColumn("rowNum", f.row_number().over(winspec))
.withColumn("minX", min(f.col("x")).over(winspec))
.withColumn("maxY", max(f.col("y")).over(winspec))
.withColumn("maxZ", max(f.col("z")).over(winspec))
.filter(f.col("rowNum")<=top_records)
.selectExpr("id", "prediction", "minX as x", "maxY as y", "maxZ as z")
)
cluster_labels = get_top_n_clusters(model, top_out_cluster=4)
fdf = filtering(df_pred, labels=cluster_labels, top_records=1)
fdf.show()
+-----+----------+----------+----------+-----------+
| id|prediction| x| y| z|
+-----+----------+----------+----------+-----------+
|row15| 7|-10.505684|-1.6820424|-0.32242402|
| row4| 9| -9.426199| 0.5639291| 3.5664654|
| row0| 2| -8.317323|-1.3278837|-0.33906546|
|row14| 4|-0.9338185|-3.6411285| -3.7280529|
+-----+----------+----------+----------+-----------+

Group By using all column from left table after join with replicated names in pyspark data frame

I have a Spark DataFrame obtained by joining two table. They share the column "name
valuesA = [('A',1,5),('B',7,12),('C',3,6),('D',4,9)]
TableA = spark.createDataFrame(valuesA,['name','id', 'otherValue']).alias('ta')
valuesB = [('A',1),('A',4),('B',2),('B',8),('E',4)]
TableB = spark.createDataFrame(valuesB,['name','id']).alias('tb')
joined = TableA.join(TableB, TableA.name==TableB.name, 'left')
I would like to do something similar to a select joined.select('ta.*').show() for groupby but joined.groupBy('ta.*').count() raises an error.
How can I implement something like that without having to explicitly list the columns? joined.groupBy(TableA.columns).count() raises issue because "name" is not unique
As an alternative how can I retrieve the column with proper alias from joined?
PS Doing the join as joined = TableA.join(TableB, ['name'], 'left') is not a useful solution because I have columns that are not used in the join condition that have the same name in table A and B
You can always use a list comprehension to get a list of column names for the groupBy:
aliasListTableA = ['ta.' + c for c in TableA.columns]
joined.groupBy(aliasListTableA).count().show()
Output:
+----+---+----------+-----+
|name| id|otherValue|count|
+----+---+----------+-----+
| B| 7| 12| 2|
| D| 4| 9| 1|
| C| 3| 6| 1|
| A| 1| 5| 2|
+----+---+----------+-----+
In general I try to avoid alias as it hides the origin of a column:
aliasListTableA = ['ta_' + c for c in TableA.columns]
aliasListTableB = ['tb_' + c for c in TableB.columns]
joined = joined.toDF(*(aliasListTableA + aliasListTableB))
joined.show()
Output:
+-------+-----+-------------+-------+-----+
|ta_name|ta_id|ta_otherValue|tb_name|tb_id|
+-------+-----+-------------+-------+-----+
| B| 7| 12| B| 2|
| B| 7| 12| B| 8|
| D| 4| 9| null| null|
| C| 3| 6| null| null|
| A| 1| 5| A| 1|
| A| 1| 5| A| 4|
+-------+-----+-------------+-------+-----+

How to translate COUNTIFS formula in ARRAYFORMULA to automatically insert the formula in each row

With my countifs formula in column C I want to auto-number (running total) all occurrences of an identical string in column A (e.g. Apple or Orange) but only if on the same row where the string appears column B is of a certain type, e.g. if in column B the type is of "fruit" in column C auto number all occurrences of an identical string in column A. For each new string which is of type "fruit" start the numbering all over again.
The outcome should be like this:
+---+-----------+-------+---+--+
| | A | B | C | |
+---+-----------+-------+---+--+
| 1 | Apple | Fruit | 1 | |
| 2 | Apple | Fruit | 2 | |
| 3 | Mercedes | Car | 0 | |
| 4 | Mercedes | Car | 0 | |
| 5 | Orange | Fruit | 1 | |
| 6 | Orange | Fruit | 2 | |
| 7 | Apple | Fruit | 3 | |
+---+-----------+-------+---+--+
The formula in column C:
=COUNTIFS($A1:$A$1;A1;$B1:$B$1;"Fruit")
=COUNTIFS($A$1:$A2;A2;$B$1:$B2;"Fruit")
=COUNTIFS($A$1:$A3;A3;$A$1:$A3;"Fruit")
…and so on…
I want to translate this formula into an array formula and put this into the header so the formula will automatically expand.
No matter what I've tried it won't work.
Any help is truly appreciated!
Here's a link to a sheet: [https://docs.google.com/spreadsheets/d/1lgbuLbTSnyKkqr33NdVuDEv5eoXFwatX1rgeF9YpIks/edit?usp=sharing][1]
={"ARRAYFORMULA HERE"; ARRAYFORMULA(IF(LEN(B2:B), IF(B2:B="Fruit",
MMULT(N(ROW(B2:B)>=TRANSPOSE(ROW(B2:B))), N(B2:B="Fruit"))-
HLOOKUP(0, MMULT(N(ROW(B2:B)>TRANSPOSE(ROW(B2:B))), N(B2:B="Fruit")),
MATCH(VLOOKUP(ROW(B2:B), IF(N(B2:B<>B1:B), ROW(B2:B), ), 1, 1),
VLOOKUP(ROW(B2:B), IF(N(B2:B<>B1:B), ROW(B2:B), ), 1, 1), 0), 0), 0), ))}
demo spreadsheet
=ARRAYFORMULA(IF(LEN(B2:B), IF(B2:B="Fruit",
MMULT(N(ROW(B2:B)>=TRANSPOSE(ROW(B2:B))), N(B2:B="Fruit")), 0), ))

OpenCV Haar Classifier result table explanation

I am trying to create a Haar classifier to recognise objects however I can't seem to figure out what the results table that is produced at each stage stands for.
E.g. 1
===== TRAINING 1-stage =====
<BEGIN
POS count : consumed 700 : 700
NEG count : acceptanceRatio 2500 : 0.452161
Precalculation time: 9
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 1| 1|
+----+---------+---------+
| 4| 1| 1|
+----+---------+---------+
| 5| 1| 0.7432|
+----+---------+---------+
| 6| 1| 0.6312|
+----+---------+---------+
| 7| 1| 0.5112|
+----+---------+---------+
| 8| 1| 0.6104|
+----+---------+---------+
| 9| 1| 0.4488|
+----+---------+---------+
END>
E.g. 2
===== TRAINING 2-stage =====
<BEGIN
POS count : consumed 500 : 500
NEG count : acceptanceRatio 964 : 0.182992
Precalculation time: 49
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
I'm not sure what the N, HR, and FA are referring to in each of these cases. Could someone explain what they stand for and what they mean?
Searching for "HR" in the OpenCV source leads us to this file. Lines 1703-1707 inside CvCascadeBoost::isErrDesired print the table:
cout << "|"; cout.width(4); cout << right << weak->total;
cout << "|"; cout.width(9); cout << right << hitRate;
cout << "|"; cout.width(9); cout << right << falseAlarm;
cout << "|" << endl;
cout << "+----+---------+---------+" << endl;
So HR and FA stand for hit rate and false alarm. Conceptually: hitRate = % of positive samples that are classified correctly as such. falseAlarm = % of negative samples incorrectly classified as positive.
Reading the code for CvCascadeBoost::train, we can see the following while loop
cout << "+----+---------+---------+" << endl;
cout << "| N | HR | FA |" << endl;
cout << "+----+---------+---------+" << endl;
do
{
[...]
}
while( !isErrDesired() && (weak->total < params.weak_count) );
Just looking at this, and not knowing much about the specifics of boosting, we can make the educated guess that training works until error is low enough, as measured by falseAlarm.

Resources