Hive Join returning zero records - join

I have two Hive tables and I am trying to join both of them. The tables are not clustered or partitioned by any field. Though the tables contain records for common key fields, the join query always returns 0 records. All the data types are 'string' data types.
The join query is simple and looks something like below
select count(*) cnt
from
fsr.xref_1 A join
fsr.ipfile_1 B
on
(
A.co_no = B.co_no
)
;
Any idea what could be going wrong? I have just one record (same value) in both the tables.
Below are my table definitions
CREATE TABLE xref_1
(
co_no string
)
clustered by (co_no) sorted by (co_no asc) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
CREATE TABLE ipfile_1
(
co_no string
)
clustered by (co_no) sorted by (co_no asc) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Hi You are using Star Schema Join. Please use your query like this:
SELET COUNT(*) cnt FROM A a JOIN B b ON (a.key1 = b.key1);
If still have issue Then use MAPJOIN:
set hive.auto.convert.join=true;
select count(*) from A join B on (key1 = key2)
Please see Link for more detail.

Related

sql join clause is not returning expected results with arabic text

I really been working to solve this problem for quite a while, I have 2 tables with the same structure as follows:
registerationNumber int, CompanyName nvarchar, areaName nvarchar,phoneNumber int , email nvarchar, projectStatus nvarchar
All columns with data type nvarchar contains Arabic text except the email column,
tableA contains 675 rows and tableB contains 397 rows all exists in tableA
What I am trying to do is select the non matching rows from tableA,
they should be 675 - 398 = 277 rows
everytime I run the where clause I get all tables returned
The join clause I am writing is like this:
select a.registerationNumber
from tableA a left outer join tableB b
on a.registerationNumber = b.registerationNumber
but I am not getting any results, I tried all types of joins but I am getting the same results.
I created a sample database and inserted English data in the tables and it worked fine with the following clause:
select * from tblAllProjects a right join tblhalfProjects h
on a.registerationNumber = h.registerationNumber
which means that I am writing the correct the correct syntax,
I know that I should use the following syntax on selecting Arabic text:
Select * from tableA where comanyName like N'arabic_text'
Anyone knows what seems to be the problem ?
I meant to say that you should do:
select a.registerationNumber
from tableA a left outer join tableB b
on a.registerationNumber = b.registerationNumber
where b.registrationNumber is NULL
This should select all registrationNumbers from tableA, with no match in tableB.
another way to write this (but probably slower) is:
select a.registerationNumber
from tableA a
where a.registerationNumber not in (
select b.registerationNumber
from tableB )
You should/can use this second approach if there are only "a few" records returned from the sub-query.

Populating Fact Tables(Data Warehouse) and Querying

I am not sure how to query my fact tables(covid and vaccinations), I populated the dimensions with dummy data, I am supposed to leave the fact tables empty? As far as I know, they would get populated when I write the queries.
I am not sure how to query the tables I have tried different things, but I get an empty result.
Below is a link to the schema.
I want to find out the "TotalDeathsUK"(fact table COVID) for the last year caused by each "Strain"(my strain table has 3 strain in total.
You can use MERGE to poulate your fact table COVIDFact :
MERGE
INTO factcovid
using (
SELECT centerid,
dateid,
patientid,
strainid
FROM yourstagingfacttable ) AS f
ON factcovid.centerid = f.centerid AND factcovid.dateid=f.dateid... //the join columns
WHEN matched THEN
do nothing WHEN NOT matched THEN
INSERT VALUES
(
f.centerid,
f.dateid,
f.patientid,
f.strainid
)
And for VaccinationsFact :
MERGE
INTO vaccinations
using (
SELECT centerid,
dateid,
patientid,
vaccineid
FROM yourstagingfacttable ) AS f
ON factcovid.centerid = f.centerid //join condition(s)
WHEN matched THEN
do nothing WHEN NOT matched THEN
INSERT VALUES
(
f.centerid,
f.dateid,
f.patientid,
f.vaccineid
)
For the TotalDeathUK measure :
SELECT S.[Name] AS Strain, COUNT(CF.PatientID) AS [Count of Deaths] FROM CovidFact AS CF
LEFT JOIN Strain AS S ON S.StrainID=CF.StrainID
LEFT JOIN Time AS T ON CF.DateID=T.DateID
LEFT JOIN TreatmentCenter AS TR ON TR.CenterID=CF.CenterID
LEFT JOIN City AS C ON C.CityID = TR.CityID
WHERE C.Country LIKE 'UK' AND T.Year=2020
AND Result LIKE 'Death' // you should add a Result column to check if the Patient survived or died
GROUP BY S.[Name]

Left outer join with 3 tables and subquery

sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.
This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?
Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)

Ambiguous column error creating table in Aster Studio 6.0

I am new to databases and am posting a problem from work. I am creating a table in Aster Studio 6.0, but got an error about an ambiguous column. I ran the same query in Teradata SQL Assistant and did not get an error.
I have six tables with millions of rows named EDW.SWIFTIQ_TRANS_DTL, EDW.SWIFTIQ_STORE, EDW.SWIFTIQ_PROD, EDW.STORE_XREF, EDW.TDLNX_STR_OUTLT, and EDW.SURV_CWC.
EDW represents the original database, but the columns were labeled with aliases.
I did a trim() on the VARCHAR columns for saving spool space. For the error about TDLNX_RTL_OUTLT_NBR, I performed an INNER JOIN on similar columns from two different tables. Doing a preview in SQL Assistant, there was a temporary table with only one column called TDLNX_RTL_OUTLT_NBR.
Here’s the SQL query:
CREATE TABLE public.table_name
DISTRIBUTE BY HASH (SRC_SYS_PROD_ID) AS (
SELECT * FROM load_from_teradata(
ON public.load_from_teradata_dummy
TDPID(‘database_name')
USERNAME(’user_name')
PASSWORD(’ss')
QUERY ('SELECT e.TDLNX_RTL_OUTLT_NBR, e.OUTLT_ST_ADDR_TXT, e.STORE_OUTLT_ZIP_CD, d.TRANS_ID, d.TRANS_DT,
d.TRANS_TM, d.UNIT_QTY, d.SRC_SYS_STORE_ID, d.SRC_SYS_PROD_ID, d.SRC_SYS_NM, a.SRC_SYS_STORE_ID, a.SRC_SYS_NM, a.STORE_NM,
a.CITY_NM, a.ZIP_CD, a.ST_cd, p.SRC_SYS_PROD_ID, p.SRC_SYS_NM, p.UPC_CD, p.PROD_ID, f.SRC_SYS_STORE_ID, f.SRC_SYS_NM,
f.TDLNX_RTL_OUTLT_NBR, g.SURV_CWC_WSLR_CUST_PARTY_ID, g.AGE_CD, g.HIGH_END_ACCT_FLG, g.RACE_ETHNC_CD, g.OCCPN_CD
FROM EDW.SWIFTIQ_TRANS_DTL d
INNER JOIN EDW.SWIFTIQ_STORE a
ON trim( a.SRC_SYS_STORE_ID) = trim(d.SRC_SYS_STORE_ID)
INNER JOIN EDW.SWIFTIQ_PROD p
ON trim(p.SRC_SYS_PROD_ID) = trim(d.SRC_SYS_PROD_ID)
and p.SRC_SYS_NM = d.SRC_SYS_NM
INNER JOIN EDW.STORE_XREF f
ON trim(f.SRC_SYS_STORE_ID) = trim(a.SRC_SYS_STORE_ID)
INNER JOIN EDW.TDLNX_STR_OUTLT e
ON trim(e.TDLNX_RTL_OUTLT_NBR)= trim(f.TDLNX_RTL_OUTLT_NBR)
INNER JOIN EDW.SURV_CWC g
ON g.SURV_CWC_WSLR_CUST_PARTY_ID = e.WSLR_CUST_PARTY_ID
WHERE TRANS_DT between ''2015-01-01'' and ''2015-03-31''')
num_instances('4') ) );
ERROR: column reference 'TDLNX_RTL_OUTLT_NBR' is ambiguous.
EDIT: Forgot to include a description about the table aliases. a stands for EDW.SWIFTIQ_STORE, p for EDW.SWIFTIQ_PROD, f for EDW.STORE_XREF, e for EDW.TDLNX_STR_OUTLT, g for EDW.SURV_CWC, and d for EDW.SWIFTIQ_TRANS_DTL.
You will get the same error when you try CREATE TABLE AS SELECT in Teradata. There are three column names, SRC_SYS_NM & SRC_SYS_PROD_ID & SRC_SYS_STORE_ID, which are used multiple times (with different table aliases) within the SELECT.
Add column aliases to make those names unique, e.g. trans_SRC_SYS_NM instead of d.SRC_SYS_NM.
Additionally the TRIMs in the joins are a very bad idea. You will probably not save that much spool, but force the optimizer to redistribute all spools for join-preparation.

Emulating an interval join in hive

I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.

Resources