Spark dataframe reduceByKey - join

I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.

To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.


Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today =
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
(key, state.get())
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.

How do I remove rows of an RDD whose key is not in another RDD?

Let's say I have a PairRDD, students (id, name). I would like to only keep rows where id is in another RDD, activeStudents (id).
The solution I have is to create a PairDD from activeStudents, (id, id), and the do a join with students.
Is there a more elegant way of doing this?
Thats a pretty good solution to start with. If active students is small enough you could collect the ids as a map and then filter with the id presence (this avoids having to a do a shuffle).
Much like you thought, you can do an outer join if both RDDs contain keys and values.
val students: RDD[(Long, String)]
val activeStudents: RDD[Long]
val activeMap: RDD[(Long, Unit)] = -> ())
val activeWithName: RDD[(Long, String)] =
students.leftOuterJoin(activeMap).flatMapValues {
case (name, Some(())) => Some(name)
case (name, None) => None
If you don't have to join those two data sets then you should definitely avoid it.
I had a similar problem recently and I successfully solved it using a broadcasted Set, which I used in UDF to check whether each RDD row (rather value from one of its columns) is in that Set. That UDF is than used as the basis for the filter transformation.
More here: whats-the-most-efficient-way-to-filter-a-dataframe.
Hope this helps. Ask if it's not clear.

Is it possible to read a message from a PubSub and separate its data in different elements of a PCollection<String>? If so, how?

Now, I have the below code:
PCollection<String> input_data =
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.

Is it possible to make a nested FOREACH without COGROUP in PigLatin?

I want to use the FOREACH like:
res = CROSS a, b;
-- some processing
By this I mean to make for each element of a a cross-product with all the elements of b, then perform some custom filtering and return tuples.
Custom filetering = res_filtered = FILTER res BY ...;
GENERATE res_filtered.
How to do it with a nested CROSS no more no less inside a FOR loop without prior GROUP or COGROUP?
Depending on the specifics of your filtering, you may be able to design a limited set of disjoint classes of elements in a and b, and then JOIN on those. For example:
If your filtering rules are
if a_attr starts with "Foo" and b is 4, accept
if a_attr starts with "Bar" and b is greater than 17, accept
if a_attr begins with a letter in [m-z] and b is less than 0, accept
otherwise, reject
Then you can write a UDF that will return 1 for items satisfying the first rule, 2 for the second, 3 for the third, and NULL otherwise. Your CROSS/FILTER then becomes
res = JOIN a BY myUDF(a), b BY myUDF(b);
Pig drops null values in JOINs, so only pairs satisfying your filtering criteria will be passed.
CROSS generates a cross-product of all the tuples in each relation. So there is no need to have a nested FOREACH. Just do the CROSS and then FILTER:
a: {a_attr: chararray}
b: {b_attr: int}
crossed = CROSS a, b;
crossed: {a::a_attr: chararray,b::b_attr: int}
res = FILTER crossed BY ... -- your custom filtering
If you have the FILTER immediately after the CROSS, you should not have (unnecessary) excessive IO trouble from the CROSS writing the entire cross-product to disk before filtering. Records that get filtered will never be written at all.
