Hive query, efficient non equi join?

Hive query, efficient non equi join? - join

I have two tables.
tableOne contains
userid
gameid
starttimestamp
endtimestamp
tableTwo contains
userid
actiontimestamp
someaction
Given the userid and gameid, I want to see how many actions there were in each game id. Given only equi join is allowed, what's a efficient way to join them together?
Most of my crossjoin and filter attempts ended up mapper and reducer getting stuck at 100%.

You could handle all your "theta join" (non-equijoin) conditions in the WHERE clause. Like this:
SELECT * FROM OrderLineItem li LEFT OUTER JOIN ProductPrice p ON p.ProductID = li.ProductID
WHERE (p.StartDate IS NULL AND p.EndDate IS NULL)
OR li.OrderDate BETWEEN p.StartDate AND p.EndDate;
Of course, this example assumes StartDate and EndDate are both non-nullable columns of ProductPrice.

Non equi joins are not available in Hive.
For optimizing equi joins you can try the following.
1.You can implement Buckets in Hive.
2.Read this facebook article also.
3.Do you have more than one job ?.if yes ,enable parallel execution in hive.
if your jobs are indepenent they run parallel.
4.If one of the tables is small ,use distributed cache with add file option in hive.

Related

How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed)

I have a data flow with a Union on two tables then joining the results of the Union to another table. I keep receiving the following error when I try debugging the pipeline or previewing the data.
DF-JOIN-002 at Join 'Join1'(Line 40/Col 26): Only 2 join condition(s) allowed
I'm basically trying to build a pipeline to automate this query:
SELECT DISTINCT k.acct_id, s.Id, Email, FirstName, LastName FROM table_3 s
INNER JOIN
( (SELECT acct_id, event_date FROM table_1)
UNION (SELECT acct_id, event_date FROM table_2)) k
ON k.acct_id = s.Archtics_acct_id__c
WHERE event_date = 'xxxx-xx-xx'
enter image description here

I figured it out after some time. I had to delete the Join activity and add it again. It was still linked to another source. Even though only two sources were selected in the join settings

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.

Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL

You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Unusual Joins SQL

I am having to convert code written by a former employee to work in a new database. In doing so I came across some joins I have never seen and do not fully understand how they work or if there is a need for them to be done in this fashion.
The joins look like this:
From Table A
Join(Table B
Join Table C
on B.Field1 = C.Field1)
On A.Field1 = B.Field1
Does this code function differently from something like this:
From Table A
Join Table B
On A.Field1 = B.Field1
Join Table C
On B.Field1 = C.Field1
If there is a difference please explain the purpose of the first set of code.
All of this is done in SQL Server 2012. Thanks in advance for any help you can provide.

I could create a temp table and then join that. But why use up the cycles\RAM on additional storage and indexes if I can just do it on the fly?
I ran across this scenario today in SSRS - a user wanted to see all the Individuals granted access through an AD group. The user was using a cursor and some temp tables to get the users out of AD and then joining the user to each SSRS object (Folders, reports, linked reports) associated with the AD group. I simplified the whole thing with Cross Apply and a sub query.
GroupMembers table
GroupName
UserID
UserName
AccountType
AccountTypeDesc
SSRSOjbects_Permissions table
Path
PathType
RoleName
RoleDesc
Name (AD group name)
The query needs to return each individual in an AD group associated with each report. Basically a Cartesian product of users to reports within a subset of data. The easiest way to do this looks like this:
select
G.GroupName, G.UserID, G.Name, G.AccountType, G.AccountTypeDesc,
[Path], PathType, RoleName, RoleDesc
from
GroupMembers G
cross apply
(select
[Path], PathType, RoleName, RoleDesc
from
SSRSOjbects_Permissions
where
Name = G.GroupName) S;
You could achieve this with a temp table and some outer joins, but why waste system resources?

I saw this kind of joins - it's MS Access style for handling multi-table joins. In MS Access you need to nest each subsequent join statement into its level brackets. So, for example this T-SQL join:
SELECT a.columna, b.columnb, c.columnc
FROM tablea AS a
LEFT JOIN tableb AS b ON a.id = b.id
LEFT JOIN tablec AS c ON a.id = c.id
you should convert to this:
SELECT a.columna, b.columnb, c.columnc
FROM ((tablea AS a) LEFT JOIN tableb AS b ON a.id = b.id) LEFT JOIN tablec AS c ON a.id = c.id
So, yes, I believe you are right in your assumption

Alternative way of joining two datasets in SAS

I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2.
Since I am very-very new to SAS, I am trying to stick to SQL logic.
So basically I did (shot version)
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
OR DS1.COL3=DS2.COL3
OR DS1.COL4=DS2.COL2
...
After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?

I would use 3 separate queries and use a UNION
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
On DS1.COL3=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
ON DS1.COL4=DS2.COL2
...

You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.
I suggest adding additional clauses to exclude null results.
Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.
Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).
EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.

First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.
Here is how you can do it.
I have assumed you match field as ID
proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;
data out;
merge DS1 DS2;
by ID;
run;
You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.

JPQL join two entities with no direct relations

I have an issue: When I am trying to join two tables which do not have a foreign key or a direct entity relation through my java code within themselves. I am using the below JPQL query: -
SELECT p FROM P p, OM orgm WHERE p.o.id = orgm.o.id and p.u.id = orgm.u.id and orgm.ma = true and p.u.id = ? AND p.o.id IN (:oId);
But this turns to a MySQL query which has a "cross join" which obviously is expensive.
What I need is to make sure that a similar query gives me an inner join MySQL query between the two tables.
I am trying to make usage of the "WITH" clause but seems that it doesn't work with inner join.
Please revert what can be done in this scenario.
Thanks in advance.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Hive query, efficient non equi join? - join

Related

How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed)

Why does Hive warn that this subquery would cause a Cartesian product?

Unusual Joins SQL

Alternative way of joining two datasets in SAS

JPQL join two entities with no direct relations

Categories

Resources