I have multiple tables Table A, Table B, and Table C. I want to be able to use Hive's bucketed map join. I am aware that buckets for A, B, C should be multiples of each other.
Is there a general rule on estimating the required #buckets while creating the tables so as to ensure a map-side join?
I haven't used Bucket Map Join in production, so just some inference based on bucket map join's principle.
In Bucket Join, correlated buckets from both tables are join together, using small table's bucket to build hashtable, and iterate the large table's bucket file one by one in original order, probe the hash table in memory and generate join results.
So, I think small table's each bucket should be small enough to put in memory (map slot's heap size you set in mapred-site.xml). the bigger the small table is, the more buckets you should set for it.
I think big table's bucket number can be arbitrary number, just multiple of small table's bucket number.
Related
Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?
Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.
Please correct me if I am wrong.
If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
Data bucketed on the same columns, and they are used in the ON clause.
The number of buckets for one table must be a multiple of the number of buckets for the other table.
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.
Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)
In hive, Can i perform bucket map join of two tables with different bucket size (but on same key) ? Can someone please share their thoughts with explanation.
For example Table-A is bucketed by col-1 with 48 buckets, while Table-B is bucketed by col-1 with 64 buckets.
Note: Table-A bucket size is not divisible by bucket size of Table-B.
Thanks in advance..!!
According to hive:
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other.
Explanation: Suppose table A and table B needs to be joined. A has 2 buckets and B has 4 buckets.
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
For the query above, the mapper processing bucket 1 for A will only fetch 2 buckets for B. But, if they are not exact multiples, it will not possible to get exact number of buckets to be fetched.
So, in your case, it won't work unless number of buckets in one table is a multiple of the number of buckets in the other.
I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.
Assume that we have N erlang nodes, running same application. I want
to share an mnesia table T1 with all N nodes, which I see no problem.
However, I want to share another mnesia table T2 with pairs of nodes.
I mean the contents of T2 will be identical and replicated to/with
only sharing pair. In another words, I want N/2 different contents for
T2 table. Is this possible with mnesia, not with renaming T2 for each
distinct pair of nodes?
It's possible to do this with mnesia's table fragmentation, if one makes use of the mnesia_frag_hash callback behaviour. This allows you to control the distribution of keys, and it would be possible to construct the keys such that the callback is able to determine which node pair (and thus, which fragment) should be used.
Whether or not this works in your particular case depends on your access patterns and data set. Chances are that it's a pretty convoluted approach, and that you'd be better served by simply using different table names instead.
One table is always one table, no matter how many nodes you share it with. If you want pairs of nodes sharing a table, you would have to create a unique table for each pair of nodes.
You can use the same settings (records etc) for all those tables though, so there shouldn't be so much more work to get it done.