Hive Bucketed Map Join

Hive Bucketed Map Join - join

I am facing issue in executing bucketed map join.
I am using hive 0.10.
Table1 is a partitioned table on year,month and day. Each partition data is bucketed by column c1 into 128 buckets. I have almost 100 million records per day.
Table 1
create table1
(
....
....
)
partitioned by (year int,month int,day int)
CLUSTERED BY(c1) INTO 128 BUCKETS;
Table2 is a large lookup table bucketed on column c1. I have 80 million records loaded into 128 buckets.
Table 2
create table2
(
c1
c2
...
)
CLUSTERED BY(c1) INTO 128 BUCKETS;
I have checked the data and it's loaded as per expectation into buckets.
Now, I am trying to enforce bucketed map join.That's where I am stuck.
set hive.auto.convert.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.mapjoin.bucket.cache.size=1000000;
select a.c1 as c1_tb2,a.c2
b.c1,b....
from table2 a
JOIN table1 b
ON (a.c1=b.c1);
I am still not getting bucketed map join. Am I missing something? Even I tried to execute join on only 1 partition. But, still I am getting same result.
Or
Bucketed map join doesn't work partition tables?
Please help.Thanks.

This explanation is for Hive 0.13. AFAICT, bucketed map join doesn't take effect for auto converted map joins. You will need to explicitly call out map join in the syntax like this:
set hive.optimize.bucketmapjoin = true;
explain extended select /* +MAPJOIN(b) */ count(*)
from nation_b1 a
join nation_b2 b on (a.n_regionkey = b.n_regionkey);
Note that only explain extended shows you the flag that indicates if bucket map join is being used or not. Look for this line in the plan.
BucketMapJoin: true

Tables are bucketed in hive to manage/process the portion of data individually. It will make the process easy to manage and efficient in terms of performance.
Lets understand the join when the data is stored in buckets:
Lets say there are two tables user and user_visits and both table data is bucketed using user_id in 4 buckets . It means bucket 1 of user will contain rows with same user ids as that of bucket 1 of user_visits. And if a join is performed on these two tables on user_id columns, if it is possible to send bucket 1 of both tables to same mapper then good amount of optimization can be achieved. This is exactly done in bucketed map join.
Prerequisites for bucket map join:
Tables being joined are bucketized on the join columns,
The number of buckets in one table is a same/multiple of the number of buckets in the other table.
The buckets can be joined with each other, If the tables being joined are bucketized on the join columns. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

Related

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.

The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

create new field based on multiple resident tables

Given multiple in-resident tables, I'd like to create a new field based on fields in different tables.
table1:
LOAD * INLINE [
id1,val1
a1,car1
a2,car1
];
table2:
LOAD * INLINE [
id2,id1,val2
b1,a1,type1
b2,a2,type2
];
table3:
LOAD * INLINE [
id3,id2,val3
c1,b1,mfr1
c2,b2,mfr2
];
For the sake of argument, assume table1 has ~1M rows, table2 ~1K rows, and table3 ~10 rows. I'd like to create a new field that is either added to table1 or perhaps in a new table linked by id1, resulting in:
id1 val1 newval
a1 car1 car1type1mfr1
a2 car2 car2type2mfr2
Efforts:
newtable:
load val1 & val2 & val3 as newval;
No errors but no newtable or newval.
newtable:
left join (table2)
load val1&val2 as newval resident table1;
Errs with Field not found - <val2>. (Obviously I want to extend this to include table3, but if I can't do it with 2 tables then 3 just won't work.
The real data includes seven tables for this new field (lots of foreign keys). The data is being loaded from QVDs (the data is shared across multiple QVWs), closely mimicking a SQL database; none of the tables are row-wise redundant, so combining db tables into a single QVD table may be inefficient. (Plus refreshing the data is incredibly easier one table at a time.) A colleague suggested I load-join each of the QVDs into one huge table, but that doesn't seem right (nor have I successfully chain-joined even a few tables).
Using QV 12.0 desktop on win10-x64 for deployment on QVS.

#TheBudac's was part of the way there, but it only merged two of the three. Most of the problems were stemming from incorrect multi-table joins. My confusion was in the "join" syntax in Qlik; the docs make sense to me now that I see what's happening, but it wasn't as obvious to me initially.
Here's what eventually worked best for me:
temptable:
load id1 as id1a, val1 as val1a
resident table1;
left join (temptable)
load id2 as id2a, id1 as id1a, val2 as val2a
resident table2;
left join (temptable)
load id2 as id2a, val3 as val3a
resident table3;
newtable:
load id1a as id1,
val1a & val2a & val3a as newval
resident temptable;
drop table temptable;
This produced these tables:
and this tree:
Quick walk-through:
Because I'm using left join, I start with the largest table; other joins would dictate different starting condition requirements. In my case, table1 was representing the largest, so I start with that:
temptable:
load id1 as id1a, val1 as val1a
resident table1;
Each join should be against the temporary table we're working on. Renaming variables is important so that Qlik doesn't create unnecessary synthetic keys.
left join (temptable)
load id2 as id2a, id1 as id1a, val2 as val2a
resident table2;
The use of resident is important in that it does not re-query (SQL) or re-load (QVD or other file).
Repeat with the third and further tables, always joining against temptable with the new table.
Now we use that temporary table to create our new table. You can choose to augment table1 with this data instead (certainly feasible), but for me since I'm generating several new calculated fields (not shown here), it made sense to keep them logically separated.
newtable:
load id1a as id1,
val1a & val2a & val3a as newval
resident temptable;
drop table temptable;
Note that I rename the relevant key back to its original value so that this table correctly links to table. Dropping the temporary table helps clean things up, but it does no harm to keep it around (and doing so helps in debugging/learning).

Your join is the wrong way round and QlikView can only work results after they have been joined,not in process, so you will have to do another resident load to get the values concatenated into Newval. The drop table commands are important or you will end up with massive unintentional syn tables
newtable:
left join (table1)
load * resident table2; drop table 2;
Resulttable:
load id1,
val1&val2 as NewVal
resident newtable; drop newtable;

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.

If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]

How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Issues with joining multiple tables

I'm really struggling at the moment trying to work out how to join multiple tables without duplicating data.
At the moment I have 8 tables that I was wanted to get various information from per member of staff like the below:
SDQ score, Goal scores, CHI score, number of appointments, number of dna appointments
The tables and field I can see to join are as follows
tblSDQ - Assessed_By_Staff_ID
tblGoals - Recorded_By_Staff_ID
tblCHI - Recorded_By_Staff_ID
tblReferral - Staff_ID
tblStaff - Staff_ID
tblDiaryAppointment - needs to connect to tblDiaryAppointmentClinician using Clinician_Invitee_Staff_ID
I hope someone can help or advice. I just don't know if it's even possible to join all these tables using the same field, or if its possible to join them but then return a number of entries but then just count others?

Syntax depends on a rdbms you are using.
You could use join with specified join fields from both tables:
select bla-bla
from table1
join table2 on ( table1.fileld_name1 = table2.fileld_name2 )
https://dev.mysql.com/doc/refman/5.0/en/join.html
if you need outer join (to show nulls for optional tables data) you could use this:
join table2 on ( table1.fileld_name1 = table2.fileld_name2 or table2.field_name2 is null )
to join with couns you could use subqueries like this
join ( select field_name3, coint(*) as cnt from table3 goup by field_name3 ) AS table3_counts
...
where ( table3_counts.field_name3 = ... or table3_counts.field_name3 is null )
https://dev.mysql.com/doc/refman/5.0/en/from-clause-subqueries.html
PS: Joins are often slow. It's better to denormalize tables to eliminate joins and gain performance. Or do simple selects and join in backend code.

Hive join query returning Cartesian product on inner join

I am doing inner join on two tables that are created using Hive. One is a big table "trades_bucket" and another is a small table "counterparty_bucket". They are created as follows :-
DROP TABLE IF EXISTS trades_bucket;
CREATE EXTERNAL TABLE trades_bucket(
parentId STRING,
BookId STRING) CLUSTERED BY(parentId) SORTED BY(parentId) INTO 32 BUCKETS;
DROP TABLE IF EXISTS counterparty_bucket;
CREATE EXTERNAL TABLE counterparty_bucket(
Version STRING,AccountId STRING,childId STRING)
CLUSTERED BY(childId ) SORTED BY(childId) INTO 32 BUCKETS;
The Join between the tables
SELECT /*+ MAPJOIN(counterparty_bucket) */ BookId , t.counterpartysdsid, c.sds
FROM counterparty_bucket c join trades_bucket t
on c.childId = t.parentId
where c.childId ='10001684'
The problem is that the join is producing Cartesian product out of the two tables. What I mean is if big table has 100 rows and small table has 4 rows for a given id, I expect the join to return 100 rows, but I am getting back 400 rows. Anyone have a clue or anyone witnessed similar situation?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart