complicated join in spark: rdd elements have many key-value pairs - join

I'm new to spark and trying to find a way to integrate information from one rdd into another, but their structures don't lend themselves to a standard join function
I have on rdd of this format:
[{a:a1, b:b1, c:[1,2,3,4], d:d1},
{a:a2, b:b2, c:[5,6,7,8], d:d2}]
and another of this format:
[{1:x1},{2,x2},{3,x3},{4,x4},{5,x5},{6,x6},{7,x7},{8,x8}]
I want to match the values in the second rdd to their keys in the first rdd (which are in a list value in the c key). I know how to manipulate them once they're there, so I'm not too concerned about the final output, but I'd maybe like to see something like this:
[{a:a1, b:b1, c:[1,2,3,4],c0: [x1,x2,x3,x4], d:d1},
{a:a2, b:b2, c:[5,6,7,8],c0: [x5,x6,x7,x8], d:d2}]
or this:
[{a:a1, b:b1, c:[(1,x1),(2,x2),(3,x3),(4,x4)], d:d1},
{a:a2, b:b2, c:[(5,x5),(6,x6),(7,x7),(8,x8)], d:d2}]
or anything else that can match the keys in the second rdd with the values in the first. I considered making the second rdd into a dictionary, which I know how to work with, but I just think my data is too large for that.
Thank you so much, I really appreciate it.

join after flatMap, or cartesian makes too many shuffles.
One of the possible solutions is to use cartesian after groupBy with HashPartitioner.
(Sorry, this is scala code)
val rdd0: RDD[(String, String, Seq[Int], String)]
val rdd1: RDD[(Int, String)]
val partitioner = new HashPartitioner(rdd0.partitions.size)
// here is the point!
val grouped = rdd1.groupBy(partitioner.getPartition(_))
val result = rdd0.cartesian(grouped).map { case (left, (_, right)) =>
val map = right.toMap
(left._1, left._2, left._4) -> left._3.flatMap(v => map.get(v).map(v -> _))
}.groupByKey().map { case (key, value) =>
(key._1, key._2, value.flatten.toSeq, key._3)
}

I will assume that rdd1 is the input containing {a:a1, b:b1, c:[1,2,3,4], d:d1} and rdd2 has tuples [(1, x1), (2, x2), (3, x3), (4, x4), (5, x5), (6, x6), (7, x7), (8, x8)]. I will also assume that all values in the "c" field in rdd1 can be found in rdd2. If not, you need to change some of the code below.
I sometimes have to solve this type of problem. If rdd2 is small enough, I typically do a map-side join, where I first broadcast the object and then do a simple lookup.
def augment_rdd1(line, lookup):
c0 = []
for key in line['c']:
c0.append(lookup.value[key])
return c0
lookup = sc.broadcast(dict(rdd2.collect()))
output = rdd1.map(lambda line: (line, augment_rdd1(line, lookup)))
If rdd2 is too large to be broadcasted, what I normally do is use a flatMap to map every row of rdd1 to as many rows as there are elements in the "c" field, e.g. {a:a1, b:b1, c:[1,2,3,4], d:d1} would be mapped to
(1, {a:a1, b:b1, c:[1,2,3,4], d:d1})
(2, {a:a1, b:b1, c:[1,2,3,4], d:d1})
(3, {a:a1, b:b1, c:[1,2,3,4], d:d1})
(4, {a:a1, b:b1, c:[1,2,3,4], d:d1})
The flatMap is
flat_rdd1 = rdd1.flatMap(lambda line: [(key, line) for key in line['c'])])
Then, I would join with rdd2 to get an RDD which has:
({a:a1, b:b1, c:[1,2,3,4], d:d1}, x1)
({a:a1, b:b1, c:[1,2,3,4], d:d1}, x2)
({a:a1, b:b1, c:[1,2,3,4], d:d1}, x3)
({a:a1, b:b1, c:[1,2,3,4], d:d1}, x4)
The join is the following:
rdd2_tuple = rdd2.map(lambda line: line.items())
joined_rdd = flat_rdd1.join(rdd2_tuple).map(lambda x: x[1])
Finally, all you need to do is a groupByKey to get ({a:a1, b:b1, c:[1,2,3,4], d:d1}, [x1, x2, x3, x4]):
result = joined_rdd.groupByKey()

Related

Z3 - how to count matches?

I have a finite set of pairs of type (int a, int b). The exact values of the pairs are explicitly present in the knowledge base. For example it could be represented by a function (int a, int b) -> (bool exists) which is fully defined on a finite domain.
I would like to write a function f with signature (int b) -> (int count), representing the number of pairs containing the specified b value as its second member. I would like to do this in z3 python, though it would also be useful to know how to do this in the z3 language
For example, my pairs could be:
(0, 0)
(0, 1)
(1, 1)
(1, 2)
(2, 1)
then f(0) = 1, f(1) = 3, f(2) = 1
This is a bit of an odd thing to do in z3: If the exact values of the pairs are in your knowledge base, then why do you need an SMT solver? You can just search and count using your regular programming techniques, whichever language you are in.
But perhaps you have some other constraints that come into play, and want a generic answer. Here's how one would code this problem in z3py:
from z3 import *
pairs = [(0, 0), (0, 1), (1, 1), (1, 2), (2, 1)]
def count(snd):
return sum([If(snd == p[1], 1, 0) for p in pairs])
s = Solver()
searchFor = Int('searchFor')
result = Int('result')
s.add(Or(*[searchFor == d[0] for d in pairs]))
s.add(result == count(searchFor))
while s.check() == sat:
m = s.model()
print("f(" + str(m[searchFor]) + ") = " + str(m[result]))
s.add(searchFor != m[searchFor])
When run, this prints:
f(0) = 1
f(1) = 3
f(2) = 1
as you predicted.
Again; if your pairs are exactly known (i.e., they are concrete numbers), don't use z3 for this problem: Simply write a program to count as needed. If the database values, however, are not necessarily concrete but have other constraints, then above would be the way to go.
To find out how this is coded in SMTLib (the native language z3 speaks), you can insert print(s.sexpr()) in the program before the while loop starts. That's one way. Of course, if you were writing this by hand, you might want to code it differently in SMTLib; but I'd strongly recommend sticking to higher-level languages instead of SMTLib as it tends to be hard to read/write for anyone except machines.

z3py: restricting solution to a set of values

I am new to Z3-solver python. I am trying to define a list and confine all my outputs to that list for a simple operation like xor.
My code:
b=Solver()
ls=[1,2,3,4,5] #my list
s1=BitVec('s1',32)
s2=BitVec('s2',32)
x=b.check(s1^s2==1, s1 in ls, s2 in ls) #s1 and s2 belongs to the list, however, this is not the correct way
if x==sat: print(b.model().eval)
The check function doesn't work like that.
Can anyone please help me in figuring how to do this in a different way?
Ans: s1=2,s2=3; since 2xor3 = 1 and s2,s3 belongs to ls=[1,2,3,4,5]
The easiest way to do this would be to define a function that checks if a given argument is in a list provided. Something like:
from z3 import *
def oneOf(x, lst):
return Or([x == i for i in lst])
s1 = BitVec('s1', 32)
s2 = BitVec('s2', 32)
s = Solver()
ls = [1, 2, 3, 4, 5]
s.add(oneOf(s1, ls))
s.add(oneOf(s2, ls))
s.add(s1 ^ s2 == 1)
print (s.check())
print (s.model())
When I run this, I get:
sat
[s2 = 2, s1 = 3]
which I believe is what you're after.

How Racket streams work in this case?

I am currently learning Racket (just for fun) and I stumbled upon this example:
(define doubles
(stream-cons
1
(stream-map
(lambda (x)
(begin
(display "map applied to: ")
(display x)
(newline)
(* x 2)))
doubles)))
It produces 1 2 4 8 16 ...
I do not quite understand how it works.
So it creates 1 as a first element; when I call (stream-ref doubles 1) it creates a second element which is obviously 2.
Then I call (stream-ref doubles 2) which should force creating the third element so it calls stream-map for a stream which already has 2 elements – (1 2) – so it should produce (2 4) then and append this result to the stream.
Why is this stream-map always applied to the last created element? How it works?
Thank you for your help!
This is a standard trick that makes it possible for lazy streams to be defined in terms of their previous element. Consider a stream as an infinite sequence of values:
s = x0, x1, x2, ...
Now, when you map over a stream, you provide a function and produce a new stream with the function applied to each element of the stream:
map(f, s) = f(x0), f(x1), f(x2), ...
But what happens when a stream is defined in terms of a mapping over itself? Well, if we have a stream s = 1, map(f, s), we can expand that definition:
s = 1, map(f, s)
= 1, f(x0), f(x1), f(x2), ...
Now, when we actually go to evaluate the second element of the stream, f(x0), then x0 is clearly 1, since we defined the first element of the stream to be 1. But when we go to evaluate the third element of the stream, f(x1), we need to know x1. Fortunately, we just evaluated x1, since it is f(x0)! This means we can “unfold” the sequence one element at a time, where each element is defined in terms of the previous one:
f(x) = x * 2
s = 1, map(f, s)
= 1, f(x0), f(x1), f(x2), ...
= 1, f(1), f(x1), f(x2), ...
= 1, 2, f(x1), f(x2), ...
= 1, 2, f(2), f(x2), ...
= 1, 2, 4, f(x2), ...
= 1, 2, 4, f(4), ...
= 1, 2, 4, 8, ...
This knot-tying works because streams are evaluated lazily, so each value is computed on-demand, left-to-right. Therefore, each previous element has been computed by the time the subsequent one is demanded, and the self-reference doesn’t cause any problems.

Difference between `join` and `union` followed by `groupByKey` in Spark?

I cannot find a good reason why:
anRDD.join(anotherRDD)
should be different from:
anRDD.union(anotherRDD).groupByKey()
But, the latter gives me an error and the former doesn't. I can provide an example if absolutely needed, but I'd like to know from the perspective of functional abstraction. No one I've asked can give me a good explanation of this.
Here are some points that I will illustrate with some code below:
join works with two rdds each consists of pairs, and having the same key which need to be matched. The types of values of the two rdds need not be matched. The resulting rdd will alwas have entries of type (Key, (Value1, Value2))
anRDD.union(anotherRDD).groupByKey() will produce an error if anRDD and anotherRDD have different types of values; it will not produce an error if both keys and values have same type. The result will be entries of type (Key, Iterable[Value]) where Iterable need not have length 2 like in the case of join.
Example:
val rdd1 = sc.parallelize(Seq( ("a", 1) , ("b", 1)))
val rdd2 = sc.parallelize(Seq( ("a", 2) , ("b", 2)))
val rdd3 = sc.parallelize(Seq( ("a", 2.0) , ("b", 2.0))) // different Value type
val rdd4 = sc.parallelize(Seq( ("a", 1) , ("b", 1), ("a", 5) , ("b", 5)))
val rdd5 = sc.parallelize(Seq( ("a", 2) , ("b", 2), ("a", 5) , ("b", 5)))
produces the following:
scala> rdd1.join(rdd2)
res18: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[77] at join at <console>:26
scala> rdd1.union(rdd2).groupByKey
res19: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[79] at groupByKey at <console>:26
scala> rdd1.union(rdd3).groupByKey
<console>:26: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Double)]
required: org.apache.spark.rdd.RDD[(String, Int)]
rdd1.union(rdd3).groupByKey
Whereas notice the different result produced if you have repeated keys in your rdds:
scala> rdd4.union(rdd5).groupByKey.collect.mkString("\n")
res21: String =
(a,CompactBuffer(1, 5, 2, 5))
(b,CompactBuffer(1, 5, 2, 5))
scala> rdd4.join(rdd5).collect.mkString("\n")
res22: String =
(a,(1,2))
(a,(1,5))
(a,(5,2))
(a,(5,5))
(b,(1,2))
(b,(1,5))
(b,(5,2))
(b,(5,5))
Edit: OP is using Python, not Scala. There is a difference in type safety between Python and Scala. Scala will catch such things as type mismatch between the two RDDs as illustrated above; Python will not catch it right away, but will produce cryptic errors later on when you try to apply a method on objects of wrong type. And remember, Spark is written in Scala with a Python API.
Indeed, I tried OP code in the comment, and in pyspark, it works with simple actions like count(). However, it would produce an error if you for example try to square each value (which you can do on integers, but not strings)
Here is the data: note I ommited the list, I only have values 1 and 0.
B = [('b',1), ('c',0)]
C = [('b', 'bs'), ('c', 'cs')]
anRDD = sc.parallelize(B)
anotherRDD = sc.parallelize(C)
And here is the output:
>>> anRDD.join(anotherRDD).count()
2
>>> anRDD.union(anotherRDD).groupByKey().count()
2
>>> for y in anRDD.map(lambda (a, x): (a, x*x)).collect():
... print y
...
('b', 1)
('c', 0)
>>> for y in anRDD.union(anotherRDD).map(lambda (a, x): (a, x*x)).collect():
... print y
...
15/12/13 15:18:51 ERROR Executor: Exception in task 5.0 in stage 23.0 (TID 169)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Former and latter have different result set:
Former:
(K, V).join(K, W) = (K, (V, W))
The former result is equi-join, SQL analogy:
anRDD.K = anotherRDD.K
Latter:
Not only include equi-join results but also union non-matched part from anRDD and non-matched part anotherRDD.

Cypher: analog of `sort -u` to merge 2 collections?

Suppose I have a node with a collection in a property, say
START x = node(17) SET x.c = [ 4, 6, 2, 3, 7, 9, 11 ];
and somewhere (i.e. from .csv file) I get another collection of values, say
c1 = [ 11, 4, 5, 8, 1, 9 ]
I'm treating my collections as just sets, order of elements does not matter. What I need is to merge x.c with c1 with come magic operation so that resulting x.c will contain only distinct elements from both. The following idea comes to mind (yet untested):
LOAD CSV FROM "file:///tmp/additives.csv" as row
START x=node(TOINT(row[0]))
MATCH c1 = [ elem IN SPLIT(row[1], ':') | TOINT(elem) ]
SET
x.c = [ newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) ];
This won't work, it will give an intersection but not a collection of distinct items.
More RTFM gives another idea: use REDUCE() ? but how?
How to extend Cypher with a new builtin function UNIQUE() which accept collection and return collection, cleaned form duplicates?
UPD. Seems that FILTER() function is something close but intersection again :(
x.c = FILTER( newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) )
WBR,
Andrii
How about something like this...
with [1,2,3] as a1
, [3,4,5] as a2
with a1 + a2 as all
unwind all as a
return collect(distinct a) as unique
Add two collections and return the collection of distinct elements.
dec 15, 2014 - here is an update to my answer...
I started with a node in the neo4j database...
//create a node in the DB with a collection of values on it
create (n:Node {name:"Node 01",values:[4,6,2,3,7,9,11]})
return n
I created a csv sample file with two columns...
Name,Coll
"Node 01","11,4,5,8,1,9"
I created a LOAD CSV statement...
LOAD CSV
WITH HEADERS FROM "file:///c:/Users/db/projects/coll-merge/load_csv_file.csv" as row
// find the matching node
MATCH (x:Node)
WHERE x.name = row.Name
// merge the collections
WITH x.values + split(row.Coll,',') AS combo, x
// process the individual values
UNWIND combo AS value
// use toInt as the values from the csv come in as string
// may be a better way around this but i am a little short on time
WITH toInt(value) AS value, x
// might as well sort 'em so they are all purdy
ORDER BY value
WITH collect(distinct value) AS values, x
SET x.values = values
You could use reduce like this:
with [1,2,3] as a, [3,4,5] as b
return reduce(r = [], x in a + b | case when x in r then r else r + [x] end)
Since Neo4j 3.0, with APOC Procedures you can easily solve this with apoc.coll.union(). In 3.1+ it's a function, and can be used like this:
...
WITH apoc.coll.union(list1, list2) as unionedList
...

Resources