Hive Sort Merge Bucket Join - join

Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?
Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.
Please correct me if I am wrong.

If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
Data bucketed on the same columns, and they are used in the ON clause.
The number of buckets for one table must be a multiple of the number of buckets for the other table.
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.

Related

Internal working of Pentaho Mondrian for query generation

I am trying to analyse how sql queries are generated by Pentaho mondrian. Let us assume there are no aggregate tables as of now. I have noticed two types of behaviour when I try to fetch data from data warehouse (star schema) using Pentaho.
Case 1: I apply various filters and try to get fact count corresponding to it which is the default measure in my case.
Case 2: I apply the same filters as mentioned in case 1 and try to get some other measure by explicitly putting it into the measures selection box.
Observation: In both the cases, sql queries generated in the back-end include joins of fact table with multiple dimension tables as per the filters applied and columns and rows selected in Pentaho.
However, the join order is different in both the cases. In case 1, the fact table is placed at the left-most position of join whereas it is placed somewhere between the dimension tables in case 2.
I have connected Pentaho with AWS Athena at the back-end to execute queries on data stored on s3 with the help of jdbc connection. Since Athena has Presto at the back-end and Presto does not do automatic JOIN re-ordering, queries in case 2 are getting failed.
(http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html)
I noticed that hash joins are being performed by Presto here. For hash joins to be effective, the largest table should be placed on the left side of join so that the smaller table is cached in memory while performing join. This is not happening in second case and it is trying to hash the fact table which consists of a large amount of data as compared to any of the dimension tables. This causes the query to fail whenever I add measure explicitly (other than default measure) and the data range is large (across an year for example).
Can someone please give an insight into the logic behind query formation of Mondrian in both the cases. Also, is there a way we can make the fact table to always remain on the left-most position of joins in the sql queries generated by Mondrian. Or is there any property of Presto which could be set through Athena to change the join type from hash join to some other type of join in which could solve this problem.
Pentaho version - 6.1.0
Saiku version - 3.10

Determining #Buckets for Hive's Bucketed Join

I have multiple tables Table A, Table B, and Table C. I want to be able to use Hive's bucketed map join. I am aware that buckets for A, B, C should be multiples of each other.
Is there a general rule on estimating the required #buckets while creating the tables so as to ensure a map-side join?
I haven't used Bucket Map Join in production, so just some inference based on bucket map join's principle.
In Bucket Join, correlated buckets from both tables are join together, using small table's bucket to build hashtable, and iterate the large table's bucket file one by one in original order, probe the hash table in memory and generate join results.
So, I think small table's each bucket should be small enough to put in memory (map slot's heap size you set in mapred-site.xml). the bigger the small table is, the more buckets you should set for it.
I think big table's bucket number can be arbitrary number, just multiple of small table's bucket number.

Input for a sort merge bucket map join in Hive

I would like to perform a sort-merge join as described in the Hive manual (Bucketed Map Join) using the following options
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
Both tables must be bucketed and sorted on the join column. My question is - does the sort have to be global, i.e. the keys in the first bucket are less than the keys in the second bucket, or is it sufficient that each bucket is sorted?
you must define the tables to be CLUSTERED BY same column and SORTED BY same column in the same order INTO same amount of buckets.
Then, you must set the above settings as you had listed AND write the hint /*+MAPJOIN(x)*/ where x is one of the tables.
Also, both tables must be joined AS IS in the join clause and you can't use any in a sub-query before the join because the data wont be buckets and sorted after the sub-query which happens first.
Finally, the join columns must be the ones the tables are bucketed/sorted on.
When you insert the data into the tables you can either use hive.enforce.sorting setting (set to true) or manually write the sort command.
Hive doesn't check that the buckets are actually sorted so if they aren't this might cause wrong results in the output.
Each mapper will read a bucket from the first table and the corresponding bucket from the 2nd and it will perform a Merge-sort join.
To your question - No they don't have to globally sorted.
P.S.
You should issue the EXPLAIN command before running the query and you'll see if hive plans to do a Merge-sort bucket join or no.

SQLite3 Database Query Optimization

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!
Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources