Hive bucket map join with different bucket size - join

In hive, Can i perform bucket map join of two tables with different bucket size (but on same key) ? Can someone please share their thoughts with explanation.
For example Table-A is bucketed by col-1 with 48 buckets, while Table-B is bucketed by col-1 with 64 buckets.
Note: Table-A bucket size is not divisible by bucket size of Table-B.
Thanks in advance..!!

According to hive:
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other.
Explanation: Suppose table A and table B needs to be joined. A has 2 buckets and B has 4 buckets.
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
For the query above, the mapper processing bucket 1 for A will only fetch 2 buckets for B. But, if they are not exact multiples, it will not possible to get exact number of buckets to be fetched.
So, in your case, it won't work unless number of buckets in one table is a multiple of the number of buckets in the other.

Related

why does rows get the "duplicated" records in Tableau when the tables are integrated?

I cannot find any helpful information about Tableau. For your information, I have to make generic tables.
I have used the particular data source, Snowflake, as a database via Tableau. With tableau's data source, I am able to edit the connection for each endpoint's environment such as sandbox, production, etc.
At this point, in Snowflake I made some queries to check the accurate data. I have to make up the simple table and query below. The US table and England table get for every 300 rows (it's just an example).
SELECT name, gender, age,
FROM USA a
LEFT JOIN England b ON (a.gender = b.gender)
WHERE gender = 'female';
Above the query is technically making double rows that are duplicated. The rows get 200. I need to avoid the duplicated rows, so I need to take care of joins operation and group by.
SELECT name, gender, age, COUNT(*)
FROM USA a
LEFT JOIN England b ON (a.gender = b.gender)
WHERE gender = 'female'
GROUP BY 1,2,3;
As a result, the data is accurate and gets 100 rows in Snowflake.
I created the dashboard in Tableau since I integrated the two connections from different environments such as Production or Sandbox. I think this screenshot is a bit helpful for different colors in the connections section (its source is here). I have identified the inaccurate data for 200 rows since I made logical or physical tables in the data source. Also, I made a relationship and index for the US and England table.
With Tableau, in my goal, I need 100 rows for the dashboard. How can I fix this to avoid the duplicated rows in Tableau's data source?
Infomation:
The version Tableau 2021.4

Hive Sort Merge Bucket Join

Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?
Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.
Please correct me if I am wrong.
If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
Data bucketed on the same columns, and they are used in the ON clause.
The number of buckets for one table must be a multiple of the number of buckets for the other table.
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.

Google Big Query: How to select subset of smaller table using nested fake join?

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table.
I wanted to use solution by #FelipeHoffa using row_number function Row number in BigQuery?
I have created nested query as follows:
SELECT a.DevID DeviceId,
a.device_make OS
FROM
(SELECT device_id DevID, device_make, A, lat, long, is_gps
FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from [Karol.fast_food_box]
GROUP BY (....?)
ORDER BY COUNT DESC,
WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
OS
Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.
It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.
So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.
What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

Determining #Buckets for Hive's Bucketed Join

I have multiple tables Table A, Table B, and Table C. I want to be able to use Hive's bucketed map join. I am aware that buckets for A, B, C should be multiples of each other.
Is there a general rule on estimating the required #buckets while creating the tables so as to ensure a map-side join?
I haven't used Bucket Map Join in production, so just some inference based on bucket map join's principle.
In Bucket Join, correlated buckets from both tables are join together, using small table's bucket to build hashtable, and iterate the large table's bucket file one by one in original order, probe the hash table in memory and generate join results.
So, I think small table's each bucket should be small enough to put in memory (map slot's heap size you set in mapred-site.xml). the bigger the small table is, the more buckets you should set for it.
I think big table's bucket number can be arbitrary number, just multiple of small table's bucket number.

Input for a sort merge bucket map join in Hive

I would like to perform a sort-merge join as described in the Hive manual (Bucketed Map Join) using the following options
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
Both tables must be bucketed and sorted on the join column. My question is - does the sort have to be global, i.e. the keys in the first bucket are less than the keys in the second bucket, or is it sufficient that each bucket is sorted?
you must define the tables to be CLUSTERED BY same column and SORTED BY same column in the same order INTO same amount of buckets.
Then, you must set the above settings as you had listed AND write the hint /*+MAPJOIN(x)*/ where x is one of the tables.
Also, both tables must be joined AS IS in the join clause and you can't use any in a sub-query before the join because the data wont be buckets and sorted after the sub-query which happens first.
Finally, the join columns must be the ones the tables are bucketed/sorted on.
When you insert the data into the tables you can either use hive.enforce.sorting setting (set to true) or manually write the sort command.
Hive doesn't check that the buckets are actually sorted so if they aren't this might cause wrong results in the output.
Each mapper will read a bucket from the first table and the corresponding bucket from the 2nd and it will perform a Merge-sort join.
To your question - No they don't have to globally sorted.
P.S.
You should issue the EXPLAIN command before running the query and you'll see if hive plans to do a Merge-sort bucket join or no.

Resources