Presto Multi table Join with Broadcast Join Distribution - join

I have 3 tables:
A
- id1
- data
B
- id1
- id2
- data
C
- id2
- data
Table A is very small, while table B and C are potentially huge.
Table B has the joining keys for both tables A and C. So, has to be present in the first join.
From what I understand about Joins in Presto, when cost based
optmizations are not enabled, the order of Join executions is the
order of declaration of the Joins.
Also, we would obviously want to
have the smaller table A in the first Join operation as that will
reduce the data size.
So, this means the the first Join will be between tables A and B
But, if I want to perform a distributed Join,
then the build side (right side) of the Join should be the smaller
table.
So, when I come to the second Join between the result of AxB and C, inevitably the right side of the join ends up being the larger table.
Very curious on how people generally handle such a scenario in Presto. If the build side for the distributed Join had been the left side, then it would have flown naturally that we always order the smaller tables to the left.
The ideas of performing Joins in the order they are defined and expecting the right side table to be smaller for Distributed Joins seem contradictory.

Presto generally performs the join in the declared order (when cost-based optimizations are off), but it tries to avoid cross joins if possible. If you run EXPLAIN on your query, you should be able to see the actual join order for your query.
For the example above, you could avoid the cross joins manually by forcing a right-associative join with parenthesis, similar to how arithmetic works (e.g., a - (b - c)):
WITH
a(x) AS (VALUES(1)),
b(x,y) AS (VALUES (1,'a')),
c(y) AS (VALUES 'a')
SELECT *
FROM c JOIN (b JOIN a USING (x)) USING (y)

Related

Predicate Pushdown vs On Clause

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).
However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).
One can provide the condition in the ON clause, and table a will be filtered before the join is performed
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id AND a.some_name=6
Do both of these provide the predicate pushdown optimization?
Thank you
Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS
ON join condition works before join.
WHERE is applied after join.
Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 --Right table filter
will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.
And PPD does not change this behavior.
You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records
And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.
LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6
PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.
More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344
So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

How are the mappers decided while running a Hive Map Join?

This is stated on the wiki page of Apache Hive:
If all but one of the tables being joined are small, the join can be performed as a map only job.The query
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
does not need a reducer. For every mapper of A, B is read completely.
How are the number of mappers decided if one of the tables being joined is small but the other is large enough to go out of a single mapper's resources?
Will the join automatically turn into a non-map join then?
The other table cannot be too large.
It is being streamed through the mapper(s).

How are block nested loop joins optimised for disk reads?

Quick question on reasonable optimisations on BNLJ:
Say we have following relations:
A(a,b,d)
B(c,m,n,ww,d)
And following query:
Select A.a, A.b, B.c
From A join B using (d)
Where A.b > 800 AND B.m = 'zack'
In this case all possible optimisation on relations are the following:
A.a, A.b A.d (A.b > 800 A)
B.c B.d (B.m = 'zack' B)
Where we first have selection, then projection.
And now we have two ways to go about this: A as outer loop and B as inner, A as inner loop and B as outer.
Ok, I know all the stuff about buffer pages and the formulas.
However, I am getting confused on this thing.
Do we first do the optimisations for BOTH relations (regardless of what loop they are) and storing the 'optimised' pages in memory and then fetching them to perform the join (and in both cases A-B or B-A we use the optimised pages for BOTH)?
Or do we do optimisations only for relation we choose as an outer loop and don't do ANY optimisations on the relation we choose as an inner loop?
Or we do optimisations on both as we go along?
And say if our optimised version for outer loop is bigger than the buffer space, can we still stored the 'filtered' records in memory and rely that BNLJ will pick them when doing consecutive runs of the outer loop?
So in general what is the whole process of making sure that BNLJ would be executed as efficiently as possible provided the query above?
Thanks!

How to use cogroup for join

I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.

Hadoop's Map-side join implements Hash join?

I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.

Resources