Suspected alias issue in bigquery join - join

I am relatively new to bigquery and think I have an aliasing problem but can't work out what it is. Essentially, I have two tables and while the first table has the majority of the required information the second table has a date of birth that I need to join. I have written the below query and the two initial SELECT statements work in isolation and appear to return the expected values. However, when attempting to join the two tables I get an error stating:
Unrecognized name: t1_teams at [10:60]
WITH table_1 AS (SELECT competition_name, stat_season_name,
matchdata_Date, t1_teams.name, t1_players.Position, CAST(REGEXP_REPLACE(t1_players.uID, r'[a-zA-Z]', '') AS NUMERIC) AS Player_ID1, t1_players.First, t1_players.Last
FROM `prod.feed1`,
UNNEST(teams) AS t1_teams, UNNEST(t1_teams.Players) as t1_players),
table_2 AS (SELECT t2_players.uID AS Player_ID2, t2_players.stat_birth_date
FROM `prod.feed2`,
UNNEST(players) AS t2_players)
SELECT competition_name, stat_season_name, matchdata_Date, t1_teams.name, t1_players.Position, t1_players.uID, t1_players.First, t1_players.Last, t2_players.stat_birth_date
FROM table_1
LEFT JOIN table_2
ON Player_ID1 = Player_ID2
WHERE competition_name = "EPL"
AND stat_season_name = "Season 2018/2019"
Any help in steering me in the right direction would be greatly appreciated as my reading of the bigquery documentation and other searches have drawn a blank.

The problem is here:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams.name,
...
You're selecting t1_teams.name, so you end up with just name an an output column from the select list. If you want to refer to t1_teams later, then select that instead:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams,
...

Related

sql join clause is not returning expected results with arabic text

I really been working to solve this problem for quite a while, I have 2 tables with the same structure as follows:
registerationNumber int, CompanyName nvarchar, areaName nvarchar,phoneNumber int , email nvarchar, projectStatus nvarchar
All columns with data type nvarchar contains Arabic text except the email column,
tableA contains 675 rows and tableB contains 397 rows all exists in tableA
What I am trying to do is select the non matching rows from tableA,
they should be 675 - 398 = 277 rows
everytime I run the where clause I get all tables returned
The join clause I am writing is like this:
select a.registerationNumber
from tableA a left outer join tableB b
on a.registerationNumber = b.registerationNumber
but I am not getting any results, I tried all types of joins but I am getting the same results.
I created a sample database and inserted English data in the tables and it worked fine with the following clause:
select * from tblAllProjects a right join tblhalfProjects h
on a.registerationNumber = h.registerationNumber
which means that I am writing the correct the correct syntax,
I know that I should use the following syntax on selecting Arabic text:
Select * from tableA where comanyName like N'arabic_text'
Anyone knows what seems to be the problem ?
I meant to say that you should do:
select a.registerationNumber
from tableA a left outer join tableB b
on a.registerationNumber = b.registerationNumber
where b.registrationNumber is NULL
This should select all registrationNumbers from tableA, with no match in tableB.
another way to write this (but probably slower) is:
select a.registerationNumber
from tableA a
where a.registerationNumber not in (
select b.registerationNumber
from tableB )
You should/can use this second approach if there are only "a few" records returned from the sub-query.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Join / Aggregate Function Query

I have the following code and output:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
This generates the following output:
When I add the following COUNT aggregate funcation and Inner Join:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory, COUNT(c.OrderID) AS Orders
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
INNER JOIN Sales.Orders AS c ON a.CustomerID = c.CustomerID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
The output changes to:
I am not sure as to why the CustomersInThisCategory column is changing to the same as the Orders column? I'm also not sure why the results in the first ouput with 0 values are being removed in the second query as I still have the Right join present.
Any feedback would be much appreciated.
For your first query, count ( distinct a.customerId) should give you the unique customer ids in a category.
Regarding your second question, right join is performed before inner join. So the inner join will splice the records, for which a match is not found.
Hope my answer helps.

Ambiguous column error creating table in Aster Studio 6.0

I am new to databases and am posting a problem from work. I am creating a table in Aster Studio 6.0, but got an error about an ambiguous column. I ran the same query in Teradata SQL Assistant and did not get an error.
I have six tables with millions of rows named EDW.SWIFTIQ_TRANS_DTL, EDW.SWIFTIQ_STORE, EDW.SWIFTIQ_PROD, EDW.STORE_XREF, EDW.TDLNX_STR_OUTLT, and EDW.SURV_CWC.
EDW represents the original database, but the columns were labeled with aliases.
I did a trim() on the VARCHAR columns for saving spool space. For the error about TDLNX_RTL_OUTLT_NBR, I performed an INNER JOIN on similar columns from two different tables. Doing a preview in SQL Assistant, there was a temporary table with only one column called TDLNX_RTL_OUTLT_NBR.
Here’s the SQL query:
CREATE TABLE public.table_name
DISTRIBUTE BY HASH (SRC_SYS_PROD_ID) AS (
SELECT * FROM load_from_teradata(
ON public.load_from_teradata_dummy
TDPID(‘database_name')
USERNAME(’user_name')
PASSWORD(’ss')
QUERY ('SELECT e.TDLNX_RTL_OUTLT_NBR, e.OUTLT_ST_ADDR_TXT, e.STORE_OUTLT_ZIP_CD, d.TRANS_ID, d.TRANS_DT,
d.TRANS_TM, d.UNIT_QTY, d.SRC_SYS_STORE_ID, d.SRC_SYS_PROD_ID, d.SRC_SYS_NM, a.SRC_SYS_STORE_ID, a.SRC_SYS_NM, a.STORE_NM,
a.CITY_NM, a.ZIP_CD, a.ST_cd, p.SRC_SYS_PROD_ID, p.SRC_SYS_NM, p.UPC_CD, p.PROD_ID, f.SRC_SYS_STORE_ID, f.SRC_SYS_NM,
f.TDLNX_RTL_OUTLT_NBR, g.SURV_CWC_WSLR_CUST_PARTY_ID, g.AGE_CD, g.HIGH_END_ACCT_FLG, g.RACE_ETHNC_CD, g.OCCPN_CD
FROM EDW.SWIFTIQ_TRANS_DTL d
INNER JOIN EDW.SWIFTIQ_STORE a
ON trim( a.SRC_SYS_STORE_ID) = trim(d.SRC_SYS_STORE_ID)
INNER JOIN EDW.SWIFTIQ_PROD p
ON trim(p.SRC_SYS_PROD_ID) = trim(d.SRC_SYS_PROD_ID)
and p.SRC_SYS_NM = d.SRC_SYS_NM
INNER JOIN EDW.STORE_XREF f
ON trim(f.SRC_SYS_STORE_ID) = trim(a.SRC_SYS_STORE_ID)
INNER JOIN EDW.TDLNX_STR_OUTLT e
ON trim(e.TDLNX_RTL_OUTLT_NBR)= trim(f.TDLNX_RTL_OUTLT_NBR)
INNER JOIN EDW.SURV_CWC g
ON g.SURV_CWC_WSLR_CUST_PARTY_ID = e.WSLR_CUST_PARTY_ID
WHERE TRANS_DT between ''2015-01-01'' and ''2015-03-31''')
num_instances('4') ) );
ERROR: column reference 'TDLNX_RTL_OUTLT_NBR' is ambiguous.
EDIT: Forgot to include a description about the table aliases. a stands for EDW.SWIFTIQ_STORE, p for EDW.SWIFTIQ_PROD, f for EDW.STORE_XREF, e for EDW.TDLNX_STR_OUTLT, g for EDW.SURV_CWC, and d for EDW.SWIFTIQ_TRANS_DTL.
You will get the same error when you try CREATE TABLE AS SELECT in Teradata. There are three column names, SRC_SYS_NM & SRC_SYS_PROD_ID & SRC_SYS_STORE_ID, which are used multiple times (with different table aliases) within the SELECT.
Add column aliases to make those names unique, e.g. trans_SRC_SYS_NM instead of d.SRC_SYS_NM.
Additionally the TRIMs in the joins are a very bad idea. You will probably not save that much spool, but force the optimizer to redistribute all spools for join-preparation.

Get incremental changes between Hive partitions

I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user

Resources