Hive: Optimal way to JOIN using 2 ON conditions with NULL

Hive: Optimal way to JOIN using 2 ON conditions with NULL - join

I have 2 tables which look as below.
Table_A
ID1 ID2 NAME
112 NULL ADAM
132 990 BRIAN
NULL 980 CARL
Table_B
ID1 ID2 SURNAME
112 NULL LEVINE
132 990 LARA
NULL 980 JOHNSON
If I join the table as below the null comparisons would not work and hence not return a surname for ADAM
SELECT A.NAME,B.SURNAME
FROM
TABLE_A A
LEFT JOIN
TABLE_B B
ON A.ID1 = B.ID1
AND
A.ID2 = B.ID2;
I added a check for NULL in the ON clause for ID2 which did work but the operation turned out to be costly for even small tables. (See below)
SELECT A.NAME,B.SURNAME
FROM
TABLE_A A
LEFT JOIN
TABLE_B B
ON
(A.ID1 = B.ID1 OR (A.ID1 IS NULL AND B.ID1 IS NULL))
AND
(A.ID2 = B.ID2 OR (A.ID2 IS NULL AND B.ID2 IS NULL));
What would be the right way to go about with this comparison?

To join NULLs like normal values, use NVL() function to substitute NULL with some value which is not used normally in the data, for example -9999:
SELECT A.NAME,B.SURNAME
FROM
TABLE_A A
LEFT JOIN
TABLE_B B
ON NVL(A.ID1,-9999) = NVL(B.ID1,-9999)
AND
NVL(A.ID2,-9999) = NVL(B.ID2,-9999);

Hive doesn't support or expression in on condition.
The join condition should consists of purely equality expression.
I prefer the COALESCE function:
SELECT A.NAME,B.SURNAME
FROM
TABLE_A A
LEFT JOIN
TABLE_B B
ON
COALESCE(A.ID1, 'missing') = COALESCE(B.ID1, 'missing')
AND
COALESCE(A.ID2, 'missing') = COALESCE(B.ID2, 'missing')

This is a typical case scenario that calls for a NULL-safe equality operator, which is natively supported by Hive with the GenericUDF <=>. This operator, as I quote:
Returns same result with EQUAL(=) operator for non-null operands,
but returns TRUE if both are NULL, FALSE if one of the them is NULL.
So the SQL is as simple as the following:
select
a.name,
b.surname
from table_a a
left join table_b b
on a.id1 <=> b.id1 and a.id2 <=> b.id2;

Related

Snowflake - Joining two tables where one table's IDs are delimited by a semicolon

I am working within Snowflake and some of the IDs in a given table and column are delimited by semicolons. Despite this delimiter the tables should still be joined. Any attempt to join the table is usually met with an error of some sort.
Below I have an example of what I have attempted to do.
table_A table_B
+----------+----------+ +----------+----------+
| some_id | F_Name | | some_id | L_Name |
+----------+----------+ +----------+----------+
| 12345 | John | |12345;4321| Doe |
+----------+----------+ +----------+----------+
Attempted SQL Statement
select table_A.some_id, table_A.F_name, table_B.L_Name
from table_A
left join table_B on table_A.some_id like '%'||table_B.some_id||'%'
The source for this particular problem has shown up here but it doesn't seem to work. It may just not be possible to do joins in this particular way.
https://community.snowflake.com/s/question/0D50Z00008zPLLx/join-with-partial-string-match
Any help is most appreciated.

Your approach is correct but your use of LIKE clause isn't. You are looking for table_A.some_id as a partial part of table_B.some_id
So your query should be :
WITH table_A AS (
SELECT '12345' AS some_id, 'John' AS F_NAME
),
table_B AS (
SELECT '12345;4321' AS some_id, 'Doe' AS L_NAME
)
SELECT
table_A.some_id,
table_A.F_name,
L_Name
FROM table_A
LEFT JOIN table_B
ON table_B.some_id LIKE '%'||table_A.some_id||'%'
;

If you use a CTE to flatten the second table and then filter out duplicates
WITH expand AS (
SELECT
f.seq as seq,
f.index as index
f.value as join_token
t.l_name
FROM table_b t,
lateral split_to_table(t.some_id, ';') f
)
SELECT
a.some_id,
a.f_name,
f.l_name
FROM table_a AS a
LEFT JOIN expand AS f
ON a.some_id = f.join_token
QUALIFY row_number() over (partition by f.seq ORDER BY f.index) == 1

join / union in presto to keep email in one column

I'm trying to join two tables together in presto,
select o.email
, o.user_id
, c.email
, c.sessions
from datasource o
full join datasource2 c
on o.email = c.email
this yields:
email user_id email sessions
jeff#sessions.com 123 NULL NULL
mike#berkley.com 987 NULL NULL
jared#swiss.com 384 jared#swiss.com 14
steph#berk.com 333 NULL NULL
NULL NULL lisa#hart.com 12
the problem with this is that I want to do multiple joins on multiple data sources using email, the only workaround I can think of is to use this as a subquery, and create a new column that takes one, and if null, takes the other, then perform the full join on datasource3, rinse repeat.

You want to use COALESCE which will chose the not null of the two values.
COALESCE is very useful for a lot of things. It can take more than two values and will return the first non NULL value it gets. If all of them are NULL it will simply return NULL.
SELECT
COALLESCE(o.email, c.email) AS email
, o.user_id
, c.sessions
FROM datasource o
FULL JOIN datasource2 c
ON o.email = c.email
For the official documentation on COALESCE see here:
https://prestodb.io/docs/current/functions/conditional.html

Find records with ID in array of IDS and keep the order of records matching that of IDs [duplicate]

I have a simple SQL query in PostgreSQL 8.3 that grabs a bunch of comments. I provide a sorted list of values to the IN construct in the WHERE clause:
SELECT * FROM comments WHERE (comments.id IN (1,3,2,4));
This returns comments in an arbitrary order which in my happens to be ids like 1,2,3,4.
I want the resulting rows sorted like the list in the IN construct: (1,3,2,4).
How to achieve that?

You can do it quite easily with (introduced in PostgreSQL 8.2) VALUES (), ().
Syntax will be like this:
select c.*
from comments c
join (
values
(1,1),
(3,2),
(2,3),
(4,4)
) as x (id, ordering) on c.id = x.id
order by x.ordering

In Postgres 9.4 or later, this is simplest and fastest:
SELECT c.*
FROM comments c
JOIN unnest('{1,3,2,4}'::int[]) WITH ORDINALITY t(id, ord) USING (id)
ORDER BY t.ord;
WITH ORDINALITY was introduced with in Postgres 9.4.
No need for a subquery, we can use the set-returning function like a table directly. (A.k.a. "table-function".)
A string literal to hand in the array instead of an ARRAY constructor may be easier to implement with some clients.
For convenience (optionally), copy the column name we are joining to ("id" in the example), so we can join with a short USING clause to only get a single instance of the join column in the result.
Works with any input type. If your key column is of type text, provide something like '{foo,bar,baz}'::text[].
Detailed explanation:
PostgreSQL unnest() with element number

Just because it is so difficult to find and it has to be spread: in mySQL this can be done much simpler, but I don't know if it works in other SQL.
SELECT * FROM `comments`
WHERE `comments`.`id` IN ('12','5','3','17')
ORDER BY FIELD(`comments`.`id`,'12','5','3','17')

With Postgres 9.4 this can be done a bit shorter:
select c.*
from comments c
join (
select *
from unnest(array[43,47,42]) with ordinality
) as x (id, ordering) on c.id = x.id
order by x.ordering;
Or a bit more compact without a derived table:
select c.*
from comments c
join unnest(array[43,47,42]) with ordinality as x (id, ordering)
on c.id = x.id
order by x.ordering
Removing the need to manually assign/maintain a position to each value.
With Postgres 9.6 this can be done using array_position():
with x (id_list) as (
values (array[42,48,43])
)
select c.*
from comments c, x
where id = any (x.id_list)
order by array_position(x.id_list, c.id);
The CTE is used so that the list of values only needs to be specified once. If that is not important this can also be written as:
select c.*
from comments c
where id in (42,48,43)
order by array_position(array[42,48,43], c.id);

I think this way is better :
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY id=1 DESC, id=3 DESC, id=2 DESC, id=4 DESC

Another way to do it in Postgres would be to use the idx function.
SELECT *
FROM comments
ORDER BY idx(array[1,3,2,4], comments.id)
Don't forget to create the idx function first, as described here: http://wiki.postgresql.org/wiki/Array_Index

In Postgresql:
select *
from comments
where id in (1,3,2,4)
order by position(id::text in '1,3,2,4')

On researching this some more I found this solution:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY CASE "comments"."id"
WHEN 1 THEN 1
WHEN 3 THEN 2
WHEN 2 THEN 3
WHEN 4 THEN 4
END
However this seems rather verbose and might have performance issues with large datasets.
Can anyone comment on these issues?

To do this, I think you should probably have an additional "ORDER" table which defines the mapping of IDs to order (effectively doing what your response to your own question said), which you can then use as an additional column on your select which you can then sort on.
In that way, you explicitly describe the ordering you desire in the database, where it should be.

sans SEQUENCE, works only on 8.4:
select * from comments c
join
(
select id, row_number() over() as id_sorter
from (select unnest(ARRAY[1,3,2,4]) as id) as y
) x on x.id = c.id
order by x.id_sorter

SELECT * FROM "comments" JOIN (
SELECT 1 as "id",1 as "order" UNION ALL
SELECT 3,2 UNION ALL SELECT 2,3 UNION ALL SELECT 4,4
) j ON "comments"."id" = j."id" ORDER BY j.ORDER
or if you prefer evil over good:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY POSITION(','+"comments"."id"+',' IN ',1,3,2,4,')

And here's another solution that works and uses a constant table (http://www.postgresql.org/docs/8.3/interactive/sql-values.html):
SELECT * FROM comments AS c,
(VALUES (1,1),(3,2),(2,3),(4,4) ) AS t (ord_id,ord)
WHERE (c.id IN (1,3,2,4)) AND (c.id = t.ord_id)
ORDER BY ord
But again I'm not sure that this is performant.
I've got a bunch of answers now. Can I get some voting and comments so I know which is the winner!
Thanks All :-)

create sequence serial start 1;
select * from comments c
join (select unnest(ARRAY[1,3,2,4]) as id, nextval('serial') as id_sorter) x
on x.id = c.id
order by x.id_sorter;
drop sequence serial;
[EDIT]
unnest is not yet built-in in 8.3, but you can create one yourself(the beauty of any*):
create function unnest(anyarray) returns setof anyelement
language sql as
$$
select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
that function can work in any type:
select unnest(array['John','Paul','George','Ringo']) as beatle
select unnest(array[1,3,2,4]) as id

Slight improvement over the version that uses a sequence I think:
CREATE OR REPLACE FUNCTION in_sort(anyarray, out id anyelement, out ordinal int)
LANGUAGE SQL AS
$$
SELECT $1[i], i FROM generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
SELECT
*
FROM
comments c
INNER JOIN (SELECT * FROM in_sort(ARRAY[1,3,2,4])) AS in_sort
USING (id)
ORDER BY in_sort.ordinal;

select * from comments where comments.id in
(select unnest(ids) from bbs where id=19795)
order by array_position((select ids from bbs where id=19795),comments.id)
here, [bbs] is the main table that has a field called ids,
and, ids is the array that store the comments.id .
passed in postgresql 9.6

Lets get a visual impression about what was already said. For example you have a table with some tasks:
SELECT a.id,a.status,a.description FROM minicloud_tasks as a ORDER BY random();
id | status | description
----+------------+------------------
4 | processing | work on postgres
6 | deleted | need some rest
3 | pending | garden party
5 | completed | work on html
And you want to order the list of tasks by its status.
The status is a list of string values:
(processing, pending, completed, deleted)
The trick is to give each status value an interger and order the list numerical:
SELECT a.id,a.status,a.description FROM minicloud_tasks AS a
JOIN (
VALUES ('processing', 1), ('pending', 2), ('completed', 3), ('deleted', 4)
) AS b (status, id) ON (a.status = b.status)
ORDER BY b.id ASC;
Which leads to:
id | status | description
----+------------+------------------
4 | processing | work on postgres
3 | pending | garden party
5 | completed | work on html
6 | deleted | need some rest
Credit #user80168

I agree with all other posters that say "don't do that" or "SQL isn't good at that". If you want to sort by some facet of comments then add another integer column to one of your tables to hold your sort criteria and sort by that value. eg "ORDER BY comments.sort DESC " If you want to sort these in a different order every time then... SQL won't be for you in this case.

Left outer join with 3 tables and subquery

sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.

This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?

Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)

SQL lookup key defined by LAG function

I want to join two tables on a key based on LAG function. My query doesn't work though. I get an error:
Msg 4108, Level 15, State 1, Line 13 Windowed functions can only appear in the SELECT or ORDER BY clauses.
I shall appreciate any suggestion on how to tackle it.
**Table A**
Key
1
2
3
and so on...
**Table B**
MaxKey | Something
3 | A
5 | B
8 | C
**Expected Results**
Key|Something
1 A
2 A
3 A
4 B
5 B
6 C
SELECT
tabA.Key
,tabB.[Something]
,LAG (tabB.MaxKey,1,1) OVER (ORDER BY tabB.MaxKey) AS MinKey
,tabB.[MaxKey]
FROM TableA as tabA
LEFT JOIN TableB as tabB
ON tabA.Key > tabB.MinKey AND tabA.Key <= tabB.MaxKey

I think you can solve this using an outer apply like this:
select * from TableA a
outer apply (
select top 1 something
from TableB b
where b.maxkey >= a.[key]
) oa
Sample SQL Fiddle
Another option is to modify your query to do the lag in a derived table, I believe this might work too:
SELECT
tabA.[Key]
,tabB.[Something]
,MinKey
,tabB.[MaxKey]
FROM TableA as tabA
LEFT JOIN (
SELECT
[Something]
,LAG (MaxKey,1,1) OVER (ORDER BY MaxKey) AS MinKey
,[MaxKey]
FROM TableB) tabB
ON tabA.[key] >= tabB.MinKey AND tabA.[key] <= tabB.MaxKey
ORDER BY tabA.[key]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Hive: Optimal way to JOIN using 2 ON conditions with NULL - join

To join NULLs like normal values, use NVL() function to substitute NULL with some value which is not used normally in the data, for example -9999: SELECT A.NAME,B.SURNAME FROM TABLE_A A LEFT JOIN TABLE_B B ON NVL(A.ID1,-9999) = NVL(B.ID1,-9999) AND NVL(A.ID2,-9999) = NVL(B.ID2,-9999);

Related

Snowflake - Joining two tables where one table's IDs are delimited by a semicolon

join / union in presto to keep email in one column

Find records with ID in array of IDS and keep the order of records matching that of IDs [duplicate]

Left outer join with 3 tables and subquery

SQL lookup key defined by LAG function

Categories

Resources