Apache Spark SQL: Automatic Inner Join? - join

So I have a weird situation.
Whenever I run a sqlContext.sql with a inner join statement, I actually get an error but when I read the error, it looks like Spark has already automatically joined my two separate tables once it tries to execute the on statement.
Table1:
patient_id, code
Table2:
patient_id, date
Select code, date
from Table1
inner join Table2
on Table1.patient_id = Table2.patient_id <- exception shows the table is joined already by this point.
Any ideas about this behavior?
Error looks like thisish
org.apache.spark.sql.AnalysisException: cannot resolve 'Table2.patient_id' given input columns [patient_id, code, date]

I think that you have a typo in your program.
However, what you can do is the following:
tableOneDF.join(tableTwoDF, tableOneDF("patient_id") === tableTwoDF("patient_id"), "inner").select("code", "date")
whereas tableOneDF and tableTwoDF are two dataframes created on top of the two tables.
Just try it out, and see if it still happens.

Related

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Suspected alias issue in bigquery join

I am relatively new to bigquery and think I have an aliasing problem but can't work out what it is. Essentially, I have two tables and while the first table has the majority of the required information the second table has a date of birth that I need to join. I have written the below query and the two initial SELECT statements work in isolation and appear to return the expected values. However, when attempting to join the two tables I get an error stating:
Unrecognized name: t1_teams at [10:60]
WITH table_1 AS (SELECT competition_name, stat_season_name,
matchdata_Date, t1_teams.name, t1_players.Position, CAST(REGEXP_REPLACE(t1_players.uID, r'[a-zA-Z]', '') AS NUMERIC) AS Player_ID1, t1_players.First, t1_players.Last
FROM `prod.feed1`,
UNNEST(teams) AS t1_teams, UNNEST(t1_teams.Players) as t1_players),
table_2 AS (SELECT t2_players.uID AS Player_ID2, t2_players.stat_birth_date
FROM `prod.feed2`,
UNNEST(players) AS t2_players)
SELECT competition_name, stat_season_name, matchdata_Date, t1_teams.name, t1_players.Position, t1_players.uID, t1_players.First, t1_players.Last, t2_players.stat_birth_date
FROM table_1
LEFT JOIN table_2
ON Player_ID1 = Player_ID2
WHERE competition_name = "EPL"
AND stat_season_name = "Season 2018/2019"
Any help in steering me in the right direction would be greatly appreciated as my reading of the bigquery documentation and other searches have drawn a blank.
The problem is here:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams.name,
...
You're selecting t1_teams.name, so you end up with just name an an output column from the select list. If you want to refer to t1_teams later, then select that instead:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams,
...

SQL Multiple Joins on same table with where clause

Folks,
I've had a pretty thorough search before posting and couldn't see this answered anywhere previously. Perhaps it isn't possible.... I'm using SQL server 2008 R2
Anyway, thanks in advance for looking/helping.
I have two tables that I'd like to join.
Table1 (t1):
Account------Name--------Amount
12345-------account1-----10000.00
12346-------account2-----20000.00
Table2 (t2):
ID-----Account---extraData
10-----12345-----ZZ100
20-----12345-----ZZ250
30-----12345-----ZZ400
10-----12346-----ZZ150
20-----12346-----ZZ200
I'm trying to return the following from the above tables:
t1.Account---t1.Name------ID1(t2.ID=10)---ID2(td.ID=20)----SUM(Amount)
12345--------account1-------ZZ100------------ZZ250-------------10000.00
12346--------account2-------ZZ150------------ZZ200-------------20000.00
I have tried various joins of sorts and a union, but can't seem to get the results above. Most result in either nothing, or the Amount column returning as double the required result.
My starting point is:
Select t1.Account, t1.Name, t2A.extraData, t2B.extraData, SUM(t1.AMOUNT)
from table1 t1
join table2 t2A on t1.Account = t2A.Account and t2A.ID = '10'
join table2 t2B on t1.Account = t2B.Account and t2B.ID = '20'
Group by t1.Account, t1.Name, t2A.extraData, t2B.extraData
I've reduced the code and complexity of the query for this thread, but the problem is as above. I have no control over the table structure as they form part of an accounting system that I can't amend (I could, but I'd upset one or two people!).
Hopefully I've explained the issue clearly enough. It seems like it should be simple, but I can't seem to fathom it - perhaps I've just been staring too long. Anyway, thanks in advance for your assistance.
Edit: to change the code to reflect the first response highlighting a mistake in my posting.
Please try this. I think this helps you to achieve your result.
DECLARE #ids varchar(max)
SELECT #ids=STUFF((SELECT DISTINCT ', [' + CAST(ID AS VARCHAR(10))+']'
FROM t2
FOR XML PATH(''), TYPE)
.value('.','NVARCHAR(MAX)'),1,2,' ')
SELECT #ids
EXECUTE ('SELECT
Account,Name,'+#ids+',Amount
FROM
(SELECT t1.Account,Name,ID,ExtraData,SUM(Amount) AS Amount
FROM t1 t1 INNER JOIN t2 t2 ON t1.Account=t2.Account
GROUP BY t1.Account,Name,ID,ExtraData) AS SourceTable
PIVOT
(
MAX(ExtraData)
FOR ID IN ('+#ids+')
) AS PivotTable;')

Update with wrong JOIN in SQLAnywhere

I'm working on a project that uses SQLAnywhere and I found this Query:
update MY_TABLE table1
set table1.column1 = table3.id
from MY_TABLE table2, MY_OTHER_TABLE table3
where table2.some_col = table3.some_col and table2.other_col is null;
The problem is, that table1 which is updated and table2/table2 which are joined do not have any link, no constraint. Table1 is completely independent from the other two.
So as far as I can understand it, if the condition in the last line is met for at least one row, then ALL rows of table1 will be updated because then the join-statement is always true.
Am I right or am I missing something?
Short answer: Yes. I agree ;)
Long answer: It really looks like that there is no connection between table 1 and tables 2 and 3. So based on your input above, I'd expect the mentioned behaviour.
Also I'd remove the implicit JOIN here, as it might causes confusion.

using SQL aggregate functions with JOINs

I have two tables - tool_downloads and tool_configurations. I am trying to retrieve the most recent build date for each tool in my database. The layout of the DB is simple. One table called tool_downloads keeps track of when a tool is downloaded. Another table is called tool_configurations and stores the actual data about the tool. They are linked together by the tool_conf_id.
If I run the following query which omits dates, I get back 200 records.
SELECT DISTINCT a.tool_conf_id, b.tool_conf_id
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
When I try to add in date information I get back hundreds of thousands of records! Here is the query that fails horribly.
SELECT DISTINCT a.tool_conf_id, max(a.configured_date) as config_date, b.configuration_name
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
I know the problem has something to do with group-bys/aggregate data and joins. I can't really search google since I don't know the name of the problem I'm encountering. Any help would be appreciated.
Solution is:
SELECT b.tool_conf_id, b.configuration_name, max(a.configured_date) as config_date
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
GROUP BY b.tool_conf_id, b.configuration_name

Resources