What is the default MapReduce join used by Apache Hive? - join

What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.

Do an explain on the Hive query and you can see the execution plan.

Related

bucketing in non equi join in hive

Currently hive does support non equi join.
But as the cross product becomes pretty huge, I was wondering what are the options to tackle a large fact(257 billion rows, 37 tb) and relatively smaller(8.7 gb) dimension table join.
In case of equi join I can make it work easily with proper bucketing on the join column/columns . (using same number of buckets for SMBM practically converting to a map join). But if we think this wont be of any advantage when its a non equi join, because the values will be there in other buckets, practically triggering a shuffle i.e. a reduce phase.
If any one has any thoughts to overcome this, please suggest .....
If the dimension table fits in memory, you can create a Custom User Defined Function (UDF) as stated here, and perform the inequi-join in memory.

Internal working of Pentaho Mondrian for query generation

I am trying to analyse how sql queries are generated by Pentaho mondrian. Let us assume there are no aggregate tables as of now. I have noticed two types of behaviour when I try to fetch data from data warehouse (star schema) using Pentaho.
Case 1: I apply various filters and try to get fact count corresponding to it which is the default measure in my case.
Case 2: I apply the same filters as mentioned in case 1 and try to get some other measure by explicitly putting it into the measures selection box.
Observation: In both the cases, sql queries generated in the back-end include joins of fact table with multiple dimension tables as per the filters applied and columns and rows selected in Pentaho.
However, the join order is different in both the cases. In case 1, the fact table is placed at the left-most position of join whereas it is placed somewhere between the dimension tables in case 2.
I have connected Pentaho with AWS Athena at the back-end to execute queries on data stored on s3 with the help of jdbc connection. Since Athena has Presto at the back-end and Presto does not do automatic JOIN re-ordering, queries in case 2 are getting failed.
(http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html)
I noticed that hash joins are being performed by Presto here. For hash joins to be effective, the largest table should be placed on the left side of join so that the smaller table is cached in memory while performing join. This is not happening in second case and it is trying to hash the fact table which consists of a large amount of data as compared to any of the dimension tables. This causes the query to fail whenever I add measure explicitly (other than default measure) and the data range is large (across an year for example).
Can someone please give an insight into the logic behind query formation of Mondrian in both the cases. Also, is there a way we can make the fact table to always remain on the left-most position of joins in the sql queries generated by Mondrian. Or is there any property of Presto which could be set through Athena to change the join type from hash join to some other type of join in which could solve this problem.
Pentaho version - 6.1.0
Saiku version - 3.10

In Lucene/Solr what is the difference between Join and BlockJoin?

Join is described as pseudo-Join, because it's more equivalent to an SQL inner-query.
Whereas BlockJoin is described as more like a SQL join but requiring a sophisticated indexing schema, one that anticipates all the possible joins you'd want to make.
Could someone explain the difference between these features in terms of how to implement them at index time and query time. And what are the implications for performance?
I don't think blockjoinquery is a Solr function. I think its Lucene feature.
The solr join doesn't score documents in the from query and it doesn't return combined results. So its best used as a filter query. This will allow the main query.to score.
Block join on the other hand does use scoring and returns both results.( not 100% sure)
You can also use querytime join. This has serval scoring options. This is also a lucene feature but doesn't require special indexing blocks. I've used this in combination with a solr query parser plugin. The performance is a bit lower then blockjoin but it Works.
I have only used solr join and querytimejoin So I can't really say much about blockjoin.
As I understand, BlockJoin is for joining against nested/child documents within the same core. Join is for joining against a separate core.

How to use cogroup for join

I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.

SOLR : joining from separate indices

Is there a good way to join data from SOLR indices ? Of course, I assume server side support for this is limited, but i want to do it client side (right now, im manually doing this in java, using hashmaps and loops .... Im assuming there might be a better way to combine data from different indices).
If with join you mean relational joining a la SQL, then no.
If with join you mean merging then server side support is far from limited.
What you are looking for is index sharding.
This is not "fast" since searches are distributed and then merged, but it scales really well.
Give a read to the following articles:
http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
http://wiki.apache.org/solr/DistributedSearch

Resources