Hadoop's Map-side join implements Hash join? - join

I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?

Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.

Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.

Related

Presto Multi table Join with Broadcast Join Distribution

I have 3 tables:
A
- id1
- data
B
- id1
- id2
- data
C
- id2
- data
Table A is very small, while table B and C are potentially huge.
Table B has the joining keys for both tables A and C. So, has to be present in the first join.
From what I understand about Joins in Presto, when cost based
optmizations are not enabled, the order of Join executions is the
order of declaration of the Joins.
Also, we would obviously want to
have the smaller table A in the first Join operation as that will
reduce the data size.
So, this means the the first Join will be between tables A and B
But, if I want to perform a distributed Join,
then the build side (right side) of the Join should be the smaller
table.
So, when I come to the second Join between the result of AxB and C, inevitably the right side of the join ends up being the larger table.
Very curious on how people generally handle such a scenario in Presto. If the build side for the distributed Join had been the left side, then it would have flown naturally that we always order the smaller tables to the left.
The ideas of performing Joins in the order they are defined and expecting the right side table to be smaller for Distributed Joins seem contradictory.
Presto generally performs the join in the declared order (when cost-based optimizations are off), but it tries to avoid cross joins if possible. If you run EXPLAIN on your query, you should be able to see the actual join order for your query.
For the example above, you could avoid the cross joins manually by forcing a right-associative join with parenthesis, similar to how arithmetic works (e.g., a - (b - c)):
WITH
a(x) AS (VALUES(1)),
b(x,y) AS (VALUES (1,'a')),
c(y) AS (VALUES 'a')
SELECT *
FROM c JOIN (b JOIN a USING (x)) USING (y)

Fuzzy join on multiple columns with Spark

I have two Spark RDDs without common key that I need to join.
The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20).
The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).
Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.
My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.
At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.
What would be better solution?
I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.

How to design this spark join

I need to join two big RDDs and potentially twice. Any help is appreciated to design these joins.
Here is the problem,
First RDD is productIdA, productIdB, similarity and the size is about 100G.
Second RDD is customerId, productId, boughtPrice and size is about 35G.
The result RDD I want is productIdA, productIdB, similarity, customerIds bought both product A and B.
Because I cannot broadcast either of the RDD since both of them are quite big, my design is to aggregate the second RDD by product id then join the first RDD twice but I get huge shuffle spill and all kinds of errors (OOM or out of space because of shuffle). Put the errors aside, I would like to know if any better way to achieve the same result. Thanks
Do you have a row for every product pairing in the first RDD?
If you do (or it's close), then you might want to do something like group the second RDD by customerId, create an element for every pairing, then rearrange and group that RDD by pairing, then group to get a list of customerIds, then join to add in the similarity.
(Whether or not this will result in more or less math depends, I think, on the distribution of number of products purchased per customer.)
Like zero323's comment also implies, once you have the pairings from grouping on customerId, it might be cheaper to recalculate the similarity than to join on a huge dataset.

What is the default MapReduce join used by Apache Hive?

What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.
Do an explain on the Hive query and you can see the execution plan.

How to use cogroup for join

I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.

Resources