Spark: join key-tuple pairs into key-list value - join

I have many RDDs ( let say 4 ) of this kind: K,(v1,v2,..,vN) and I have to join them, so I simply run
r1.join(r2).join(r3).join(r4)
The result will be something like K,((v1,v2,..,vN),(v1,v2,...,vN)),(v1,v2,...,vN))... and so on. Basically, I will get a nested structure of tuples, one for each join operation.
I was wondering if there exists a way to tell Spark to output as result of the join a union of the values of each RDD. In other words, I would like to get something like:
K, [ v1,v2,..., vN,v1,v2,..., vN,v1,v2,..., v1,v2,...,vN ]

You could do a multi-join or you could save yourself from nested syntax and apply a version of cogroup instead. However, since cogroup() only allows you to group up to 4 RDD's you can kind of monkey patch it to group more. Below is an example of a multiCogroup() function:
def multiCogroup[K : ClassTag, V : ClassTag](numPartitions: Int, inputRDDs: RDD[(K, V)]*) : RDD[(K, Seq[V])] = {
val cg = new CoGroupedRDD[K](inputRDDs.toSeq, new HashPartitioner(numPartitions))
cg.mapValues { case iterables => iterables.foldLeft(Seq[V]())(_ ++ _.asInstanceOf[Iterable[V]].toSeq) }
}
Run on an example, you can see the following:
import org.apache.spark.rdd._
import org.apache.spark.HashPartitioner
import scala.reflect.ClassTag
val rdd1 = sc.parallelize(Seq(("a", 1),("b", 2),("c", 3),("d", 4)))
val rdd2 = sc.parallelize(Seq(("a", 4),("b", 3),("c", 2),("d", 1)))
val rdd3 = sc.parallelize(Seq(("c", 0),("d", 0),("e", 0)))
val rdd4 = sc.parallelize(Seq(("a", 5),("b", 5),("e", 5)))
val rdd5 = sc.parallelize(Seq(("b", -1),("c", -1),("d", -1)))
val combined = multiCogroup[String, Int](2, rdd1, rdd2, rdd3, rdd4, rdd5)
combined.foreach(println)
// (d,List(4, 1, 0, -1))
// (b,List(2, 3, 5, -1))
// (e,List(0, 5))
// (a,List(1, 4, 5))
// (c,List(3, 2, 0, -1))
A few things to note:
If your input RDD value types are non-uniform, you could umbrella the output type V into type a super type (e.g. Int and Long into Integral, String and Int into Any). This might not be recommendable though as it could cause some ambiguity issues later on in your program. In general, I think the best use case of this is when all input value types are the same.
I've defined the function to use a HashPartitioner with the number of partitions being the parameter numPartitions. It may make sense to tunnel your own Partitioner in by replacing the numPartitions argument. You can then pass the input partitioner directly to CoGroupedRDD[K](), similarly done in the implementation of cogroup.
I would probably apply some caution on using this method on large RDDs. Joins themselves can be kind of tricky depending on the size of the input data as well as the distribution of the key set. Expanding this to grouping multiple RDDs in a single cogroup could lead to similar memory issues quicker.

Related

What alternatives exist for representing sets in Z3?

Per this answer the Z3 set sort is implemented using arrays, which makes sense given the SetAdd and SetDel methods available in the API. It is also claimed here that if the array modification functions are never used, it's wasteful overhead to use arrays instead of uninterpreted functions. Given that, if my only uses of a set are to apply constraints with IsMember (either on individual values or as part of a quantification), is it a better idea to use an uninterpreted function mapping from the underlying element sort to booleans? So:
from z3 import *
s = Solver()
string_set = SetSort(StringSort())
x = String('x')
s.add(IsMember(x, string_set))
becomes
from z3 import *
s = Solver()
string_set = Function('string_set', StringSort(), BoolSort())
x = String('x')
s.add(string_set(x))
Are there any drawbacks to this approach? Alternative representations with even less overhead?
Those are really your only options, as long as you want to restrict yourself to the standard interface. In the past, I also had luck with representing sets (and in general relations) outside of the solver, keeping the processing completely outside. Here's what I mean:
from z3 import *
def FSet_Empty():
return lambda x: False
def FSet_Insert(val, s):
return lambda x: If(x == val, True, s(val))
def FSet_Delete(val, s):
return lambda x: If(x == val, False, s(val))
def FSet_Member(val, s):
return s(val)
x, y, z = Ints('x y z')
myset = FSet_Insert(x, FSet_Insert(y, FSet_Insert(z, FSet_Empty())))
s = Solver()
s.add(FSet_Member(2, myset))
print(s.check())
print(s.model())
Note how we model sets by unary relations, i.e., functions from values to booleans. You can generalize this to arbitrary relations and the ideas carry over. This prints:
sat
[x = 2, z = 4, y = 3]
You can easily add union (essentially Or), intersection (essentially And), and complement (essentially Not) operations. Doing cardinality is harder, especially in the presence of complement, but that's true for all the other approaches too.
As is usual with these sorts of modeling questions, there's no single approach that will work best across all problems. They'll all have their strengths and weaknesses. I'd recommend creating a single API, and implementing it using all three of these ideas, and benchmarking your problem domain to see what works the best; keeping in mind if you start working on a different problem the answer might be different. Please report your findings!

Hashtable in Ocaml where key is a tuple

I am trying to construct a LL1 parse table in ocaml. I'd like for the key to be a Nonterminal, input_symbol tuple. Is this possible?
I know you can do a stack of tuples:
let (k : (string*string) Stack.t) = Stack.create ();;
Thank you in advance!!
The key of a hash table in OCaml can have any type that can be compared for equality and can be hashed to an integer. The "vanilla" interface uses the built-in polymorphic comparison to compare for equality, and a built-in polymorphic hash function.
The built-in polymorphic comparison function fails for function types and for cyclic values.
There is also a functorial interface that lets you define your own equality and hash functions. So you can even have a hash table with keys that contain functions if you do a little extra work (assuming you don't expect to compare the functions for equality).
It is not difficult to make a hash table with tuples as keys:
# let my_table = Hashtbl.create 64;;
val my_table : ('_weak1, '_weak2) Hashtbl.t = <abstr>
# Hashtbl.add my_table (1, 2) "one two";;
- : unit = ()
# Hashtbl.add my_table (3, 4) "three four";;
- : unit = ()
# Hashtbl.find my_table (1, 2);;
- : string = "one two"

Dask Delayed ignores name for dependent variables

When creating a graph of calculations using delayed I'm trying to assign names so that if I visualize the graph it's readable. However, for delayed variables that are dependent on functions the name parameter doesn't seem to affect the key. Here's a toy example:
def calc_avg(a, b):
return pd.concat([a, b], axis=1).mean(axis=1)
def calc_ratio(a, b):
return a / b
a = delayed(pd.Series(np.random.rand(10)), name='a')
b = delayed(pd.Series(np.random.rand(10)), name='b')
c = delayed(pd.Series(np.random.rand(10)), name='c')
x = delayed(calc_avg, name='avg_result')(a,b)
y = delayed(calc_ratio, name='ratio_result')(x,c)
y.visualize()
You can see the visualization here (I can't embed images), but rather than seeing 'avg_result' I see 'calc_avg-#0' and rather than see 'ratio_result' I see 'calc_ratio-#1'. If I look at x.key or y.key they do not match the names that I provided. Is this the expected behavior?
The key of a dask result needs to be unique for every combination of the function that was delayed, and the inputs you give it. What you see above is the expected behaviour: you are naming the function, but a call with different inputs would expect a different output, so the key must be different.
You can specify the key you'd like associated not when you define the delayed function, but when you call it:
x = delayed(calc_avg)(a, b, dask_key_name='avg_result')
y = delayed(calc_ratio)(x, c, dask_key_name='ratio_result')

Can Z3 call python function during decision making of variables?

I am trying to solve a problem, for example I have a 4 point and each two point has a cost between them. Now I want to find a sequence of nodes which total cost would be less than a bound. I have written a code but it seems not working. The main problem is I have define a python function and trying to call it with in a constraint.
Here is my code: I have a function def getVal(n1,n2): where n1, n2 are Int Sort. The line Nodes = [ Int("n_%s" % (i)) for i in range(totalNodeNumber) ] defines 4 points as Int sort and when I am adding a constraint s.add(getVal(Nodes[0], Nodes[1]) + getVal(Nodes[1], Nodes[2]) < 100) then it calls getVal function immediately. But I want that, when Z3 will decide a value for Nodes[0], Nodes[1], Nodes[2], Nodes[3] then the function should be called for getting the cost between to points.
from z3 import *
import random
totalNodeNumber = 4
Nodes = [ Int("n_%s" % (i)) for i in range(totalNodeNumber) ]
def getVal(n1,n2):
# I need n1 and n2 values those assigned by Z3
cost = random.randint(1,20)
print cost
return IntVal(cost)
s = Solver()
#constraint: Each Nodes value should be distinct
nodes_index_distinct_constraint = Distinct(Nodes)
s.add(nodes_index_distinct_constraint)
#constraint: Each Nodes value should be between 0 and totalNodeNumber
def get_node_index_value_constraint(i):
return And(Nodes[i] >= 0, Nodes[i] < totalNodeNumber)
nodes_index_constraint = [ get_node_index_value_constraint(i) for i in range(totalNodeNumber)]
s.add(nodes_index_constraint)
#constraint: Problem with this constraint
# Here is the problem it's just called python getVal function twice without assiging Nodes[0],Nodes[1],Nodes[2] values
# But I want to implement that - Z3 will call python function during his decission making of variables
s.add(getVal(Nodes[0], Nodes[1]) + getVal(Nodes[1], Nodes[2]) + getVal(Nodes[2], Nodes[3]) < 100)
if s.check() == sat:
print "SAT"
print "Model: "
m = s.model()
nodeIndex = [ m.evaluate(Nodes[i]) for i in range(totalNodeNumber) ]
print nodeIndex
else:
print "UNSAT"
print "No solution found !!"
If this is not a right way to solve the problem then could you please tell me what would be other alternative way to solve it. Can I encode this kind of problem to find optimal sequence of way points using Z3 solver?
I don't understand what problem you need to solve. Definitely, the way getVal is formulated does not make sense. It does not use the arguments n1, n2. If you want to examine values produced by a model, then you do this after Z3 returns from a call to check().
I don't think you can use a python function in your SMT logic. What you could alternatively is define getVal as a Function like this
getVal = Function('getVal',IntSort(),IntSort(),IntSort())
And constraint the edge weights as
s.add(And(getVal(0,1)==1,getVal(1,2)==2,getVal(0,2)==3))
The first two input parameters of getVal represent the node ids and the last integer represents the weight.

Is there a difference between foreach and map?

Ok this is more of a computer science question, than a question based on a particular language, but is there a difference between a map operation and a foreach operation? Or are they simply different names for the same thing?
Different.
foreach iterates over a list and performs some operation with side effects to each list member (such as saving each one to the database for example)
map iterates over a list, transforms each member of that list, and returns another list of the same size with the transformed members (such as converting a list of strings to uppercase)
The important difference between them is that map accumulates all of the results into a collection, whereas foreach returns nothing. map is usually used when you want to transform a collection of elements with a function, whereas foreach simply executes an action for each element.
In short, foreach is for applying an operation on each element of a collection of elements, whereas map is for transforming one collection into another.
There are two significant differences between foreach and map.
foreach has no conceptual restrictions on the operation it applies, other than perhaps accept an element as argument. That is, the operation may do nothing, may have a side-effect, may return a value or may not return a value. All foreach cares about is to iterate over a collection of elements, and apply the operation on each element.
map, on the other hand, does have a restriction on the operation: it expects the operation to return an element, and probably also accept an element as argument. The map operation iterates over a collection of elements, applying the operation on each element, and finally storing the result of each invocation of the operation into another collection. In other words, the map transforms one collection into another.
foreach works with a single collection of elements. This is the input collection.
map works with two collections of elements: the input collection and the output collection.
It is not a mistake to relate the two algorithms: in fact, you may view the two hierarchically, where map is a specialization of foreach. That is, you could use foreach and have the operation transform its argument and insert it into another collection. So, the foreach algorithm is an abstraction, a generalization, of the map algorithm. In fact, because foreach has no restriction on its operation we can safely say that foreach is the simplest looping mechanism out there, and it can do anything a loop can do. map, as well as other more specialized algorithms, is there for expressiveness: if you wish to map (or transform) one collection into another, your intention is clearer if you use map than if you use foreach.
We can extend this discussion further, and consider the copy algorithm: a loop which clones a collection. This algorithm too is a specialization of the foreach algorithm. You could define an operation that, given an element, will insert that same element into another collection. If you use foreach with that operation you in effect performed the copy algorithm, albeit with reduced clarity, expressiveness or explicitness. Let's take it even further: We can say that map is a specialization of copy, itself a specialization of foreach. map may change any of the elements it iterates over. If map doesn't change any of the elements then it merely copied the elements, and using copy would express the intent more clearly.
The foreach algorithm itself may or may not have a return value, depending on the language. In C++, for example, foreach returns the operation it originally received. The idea is that the operation might have a state, and you may want that operation back to inspect how it evolved over the elements. map, too, may or may not return a value. In C++ transform (the equivalent for map here) happens to return an iterator to the end of the output container (collection). In Ruby, the return value of map is the output sequence (collection). So, the return value of the algorithms is really an implementation detail; their effect may or may not be what they return.
Array.protototype.map method & Array.protototype.forEach are both quite similar.
Run the following code: http://labs.codecademy.com/bw1/6#:workspace
var arr = [1, 2, 3, 4, 5];
arr.map(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
});
console.log();
arr.forEach(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
});
They give the exact ditto result.
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
But the twist comes when you run the following code:-
Here I've simply assigned the result of the return value from the map and forEach methods.
var arr = [1, 2, 3, 4, 5];
var ar1 = arr.map(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
return val;
});
console.log();
console.log(ar1);
console.log();
var ar2 = arr.forEach(function(val, ind, arr){
console.log("arr[" + ind + "]: " + Math.pow(val,2));
return val;
});
console.log();
console.log(ar2);
console.log();
Now the result is something tricky!
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
[ 1, 2, 3, 4, 5 ]
arr[0]: 1
arr[1]: 4
arr[2]: 9
arr[3]: 16
arr[4]: 25
undefined
Conclusion
Array.prototype.map returns an array but Array.prototype.forEach doesn't. So you can manipulate the returned array inside the callback function passed to the map method and then return it.
Array.prototype.forEach only walks through the given array so you can do your stuff while walking the array.
the most 'visible' difference is that map accumulates the result in a new collection, while foreach is done only for the execution itself.
but there are a couple of extra assumptions: since the 'purpose' of map is the new list of values, it doesn't really matters the order of execution. in fact, some execution environments generate parallel code, or even introduce some memoizing to avoid calling for repeated values, or lazyness, to avoid calling some at all.
foreach, on the other hand, is called specifically for the side effects; therefore the order is important, and usually can't be parallelised.
Short answer: map and forEach are different. Also, informally speaking, map is a strict superset of forEach.
Long answer: First, let's come up with one line descriptions of forEach and map:
forEach iterates over all elements, calling the supplied function on each.
map iterates over all elements, calling the supplied function on each, and produces a transformed array by remembering the result of each function call.
In many languages, forEach is often called just each. The following discussion uses JavaScript only for reference. It could really be any other language.
Now, let's use each of these functions.
Using forEach:
Task 1: Write a function printSquares, which accepts an array of numbers arr, and prints the square of each element in it.
Solution 1:
var printSquares = function (arr) {
arr.forEach(function (n) {
console.log(n * n);
});
};
Using map:
Task 2: Write a function selfDot, which accepts an array of numbers arr, and returns an array wherein each element is the square of the corresponding element in arr.
Aside: Here, in slang terms, we are trying to square the input array. Formally put, we are trying to compute it's dot product with itself.
Solution 2:
var selfDot = function (arr) {
return arr.map(function (n) {
return n * n;
});
};
How is map a superset of forEach?
You can use map to solve both tasks, Task 1 and Task 2. However, you cannot use forEach to solve the Task 2.
In Solution 1, if you simply replace forEach by map, the solution will still be valid. In Solution 2 however, replacing map by forEach will break your previously working solution.
Implementing forEach in terms of map:
Another way of realizing map's superiority is to implement forEach in terms of map. As we are good programmers, we'll won't indulge in namespace pollution. We'll call our forEach, just each.
Array.prototype.each = function (func) {
this.map(func);
};
Now, if you don't like the prototype nonsense, here you go:
var each = function (arr, func) {
arr.map(func); // Or map(arr, func);
};
So, umm.. Why's does forEach even exist?
The answer is efficiency. If you are not interested in transforming an array into another array, why should you compute the transformed array? Only to dump it? Of course not! If you don't want a transformation, you shouldn't do a transformation.
So while map can be used to solve Task 1, it probably shouldn't. For each is the right candidate for that.
Original answer:
While I largely agree with #madlep 's answer, I'd like to point out that map() is a strict super-set of forEach().
Yes, map() is usually used to create a new array. However, it may also be used to change the current array.
Here's an example:
var a = [0, 1, 2, 3, 4], b = null;
b = a.map(function (x) { a[x] = 'What!!'; return x*x; });
console.log(b); // logs [0, 1, 4, 9, 16]
console.log(a); // logs ["What!!", "What!!", "What!!", "What!!", "What!!"]
In the above example, a was conveniently set such that a[i] === i for i < a.length. Even so, it demonstrates the power of map().
Here's the official description of map(). Note that map() may even change the array on which it is called! Hail map().
Hope this helped.
Edited 10-Nov-2015: Added elaboration.
Here is an example in Scala using lists: map returns list, foreach returns nothing.
def map(f: Int ⇒ Int): List[Int]
def foreach(f: Int ⇒ Unit): Unit
So map returns the list resulting from applying the function f to each list element:
scala> val list = List(1, 2, 3)
list: List[Int] = List(1, 2, 3)
scala> list map (x => x * 2)
res0: List[Int] = List(2, 4, 6)
Foreach just applies f to each element:
scala> var sum = 0
sum: Int = 0
scala> list foreach (sum += _)
scala> sum
res2: Int = 6 // res1 is empty
If you're talking about Javascript in particular, the difference is that map is a loop function while forEach is an iterator.
Use map when you want to apply an operation to each member of the list and get the results back as a new list, without affecting the original list.
Use forEach when you want to do something on the basis of each element of the list. You might be adding things to the page, for example. Essentially, it's great for when you want "side effects".
Other differences: forEach returns nothing (since it is really a control flow function), and the passed-in function gets references to the index and the whole list, whereas map returns the new list and only passes in the current element.
ForEach tries to apply a function such as writing to db etc on each element of the RDD without returning anything back.
But the map() applies some function over the elements of rdd and returns the rdd. So when you run the below method it won't fail at line3 but while collecting the rdd after applying foreach it will fail and throw an error which says
File "<stdin>", line 5, in <module>
AttributeError: 'NoneType' object has no attribute 'collect'
nums = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
num2 = nums.map(lambda x: x+2)
print ("num2",num2.collect())
num3 = nums.foreach(lambda x : x*x)
print ("num3",num3.collect())

Resources