How to use joins and averages together in Hive queries - join

I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.

I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here

Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;

Related

Snowflake: Joining a Table with Effective Dates and older records are showing NULL

Summary:
In Snowflake I have a table which records the maximum number of an item which changes every so often. I want to be able to join the max number of the item for that date (effective_date). This is the most basic "example" as in my table has items "expire" when they are removed.
CREATE OR REPLACE TABLE ITEM
(
Item VARCHAR(10),
Quantity Number(5,0),
EFFECTIVE_DATE DATE
)
;
CREATE OR REPLACE TABLE REPORT
(
INVOICE_DATE DATE,
ITEM VARCHAR(10)
)
;
INSERT INTO REPORT
VALUES
('2021-02-01', '100'),
('2021-09-10', '100')
;
INSERT INTO ITEM
VALUES
('100', '10', '2021-01-01'),
('101', '15', '2021-01-01'),
('100', '5', '2021-09-01')
;
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
Returns
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,5,2021-09-01
2021-09-10,100,NULL,NULL,NULL
How do I fix this so I no longer get NULL entries on my join.
Thank you for reading this!
I am hoping to get a result like this
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,10,2021-01-01
2021-09-10,100,100,5,2021-09-01
The issue is with your data and your expectations. Your query is this:
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
which requires that the INVOICE_DATE be less than or equal to the EFFECTIVE DATE of the ITEM. This isn't the case, though. 2021-09-10 is greater than 2021-09-01 so you don't get a join hit, which is why you get NULLs. It's also why your other record is returning the wrong information from your expectations.

Find records with ID in array of IDS and keep the order of records matching that of IDs [duplicate]

I have a simple SQL query in PostgreSQL 8.3 that grabs a bunch of comments. I provide a sorted list of values to the IN construct in the WHERE clause:
SELECT * FROM comments WHERE (comments.id IN (1,3,2,4));
This returns comments in an arbitrary order which in my happens to be ids like 1,2,3,4.
I want the resulting rows sorted like the list in the IN construct: (1,3,2,4).
How to achieve that?
You can do it quite easily with (introduced in PostgreSQL 8.2) VALUES (), ().
Syntax will be like this:
select c.*
from comments c
join (
values
(1,1),
(3,2),
(2,3),
(4,4)
) as x (id, ordering) on c.id = x.id
order by x.ordering
In Postgres 9.4 or later, this is simplest and fastest:
SELECT c.*
FROM comments c
JOIN unnest('{1,3,2,4}'::int[]) WITH ORDINALITY t(id, ord) USING (id)
ORDER BY t.ord;
WITH ORDINALITY was introduced with in Postgres 9.4.
No need for a subquery, we can use the set-returning function like a table directly. (A.k.a. "table-function".)
A string literal to hand in the array instead of an ARRAY constructor may be easier to implement with some clients.
For convenience (optionally), copy the column name we are joining to ("id" in the example), so we can join with a short USING clause to only get a single instance of the join column in the result.
Works with any input type. If your key column is of type text, provide something like '{foo,bar,baz}'::text[].
Detailed explanation:
PostgreSQL unnest() with element number
Just because it is so difficult to find and it has to be spread: in mySQL this can be done much simpler, but I don't know if it works in other SQL.
SELECT * FROM `comments`
WHERE `comments`.`id` IN ('12','5','3','17')
ORDER BY FIELD(`comments`.`id`,'12','5','3','17')
With Postgres 9.4 this can be done a bit shorter:
select c.*
from comments c
join (
select *
from unnest(array[43,47,42]) with ordinality
) as x (id, ordering) on c.id = x.id
order by x.ordering;
Or a bit more compact without a derived table:
select c.*
from comments c
join unnest(array[43,47,42]) with ordinality as x (id, ordering)
on c.id = x.id
order by x.ordering
Removing the need to manually assign/maintain a position to each value.
With Postgres 9.6 this can be done using array_position():
with x (id_list) as (
values (array[42,48,43])
)
select c.*
from comments c, x
where id = any (x.id_list)
order by array_position(x.id_list, c.id);
The CTE is used so that the list of values only needs to be specified once. If that is not important this can also be written as:
select c.*
from comments c
where id in (42,48,43)
order by array_position(array[42,48,43], c.id);
I think this way is better :
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY id=1 DESC, id=3 DESC, id=2 DESC, id=4 DESC
Another way to do it in Postgres would be to use the idx function.
SELECT *
FROM comments
ORDER BY idx(array[1,3,2,4], comments.id)
Don't forget to create the idx function first, as described here: http://wiki.postgresql.org/wiki/Array_Index
In Postgresql:
select *
from comments
where id in (1,3,2,4)
order by position(id::text in '1,3,2,4')
On researching this some more I found this solution:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY CASE "comments"."id"
WHEN 1 THEN 1
WHEN 3 THEN 2
WHEN 2 THEN 3
WHEN 4 THEN 4
END
However this seems rather verbose and might have performance issues with large datasets.
Can anyone comment on these issues?
To do this, I think you should probably have an additional "ORDER" table which defines the mapping of IDs to order (effectively doing what your response to your own question said), which you can then use as an additional column on your select which you can then sort on.
In that way, you explicitly describe the ordering you desire in the database, where it should be.
sans SEQUENCE, works only on 8.4:
select * from comments c
join
(
select id, row_number() over() as id_sorter
from (select unnest(ARRAY[1,3,2,4]) as id) as y
) x on x.id = c.id
order by x.id_sorter
SELECT * FROM "comments" JOIN (
SELECT 1 as "id",1 as "order" UNION ALL
SELECT 3,2 UNION ALL SELECT 2,3 UNION ALL SELECT 4,4
) j ON "comments"."id" = j."id" ORDER BY j.ORDER
or if you prefer evil over good:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY POSITION(','+"comments"."id"+',' IN ',1,3,2,4,')
And here's another solution that works and uses a constant table (http://www.postgresql.org/docs/8.3/interactive/sql-values.html):
SELECT * FROM comments AS c,
(VALUES (1,1),(3,2),(2,3),(4,4) ) AS t (ord_id,ord)
WHERE (c.id IN (1,3,2,4)) AND (c.id = t.ord_id)
ORDER BY ord
But again I'm not sure that this is performant.
I've got a bunch of answers now. Can I get some voting and comments so I know which is the winner!
Thanks All :-)
create sequence serial start 1;
select * from comments c
join (select unnest(ARRAY[1,3,2,4]) as id, nextval('serial') as id_sorter) x
on x.id = c.id
order by x.id_sorter;
drop sequence serial;
[EDIT]
unnest is not yet built-in in 8.3, but you can create one yourself(the beauty of any*):
create function unnest(anyarray) returns setof anyelement
language sql as
$$
select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
that function can work in any type:
select unnest(array['John','Paul','George','Ringo']) as beatle
select unnest(array[1,3,2,4]) as id
Slight improvement over the version that uses a sequence I think:
CREATE OR REPLACE FUNCTION in_sort(anyarray, out id anyelement, out ordinal int)
LANGUAGE SQL AS
$$
SELECT $1[i], i FROM generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
SELECT
*
FROM
comments c
INNER JOIN (SELECT * FROM in_sort(ARRAY[1,3,2,4])) AS in_sort
USING (id)
ORDER BY in_sort.ordinal;
select * from comments where comments.id in
(select unnest(ids) from bbs where id=19795)
order by array_position((select ids from bbs where id=19795),comments.id)
here, [bbs] is the main table that has a field called ids,
and, ids is the array that store the comments.id .
passed in postgresql 9.6
Lets get a visual impression about what was already said. For example you have a table with some tasks:
SELECT a.id,a.status,a.description FROM minicloud_tasks as a ORDER BY random();
id | status | description
----+------------+------------------
4 | processing | work on postgres
6 | deleted | need some rest
3 | pending | garden party
5 | completed | work on html
And you want to order the list of tasks by its status.
The status is a list of string values:
(processing, pending, completed, deleted)
The trick is to give each status value an interger and order the list numerical:
SELECT a.id,a.status,a.description FROM minicloud_tasks AS a
JOIN (
VALUES ('processing', 1), ('pending', 2), ('completed', 3), ('deleted', 4)
) AS b (status, id) ON (a.status = b.status)
ORDER BY b.id ASC;
Which leads to:
id | status | description
----+------------+------------------
4 | processing | work on postgres
3 | pending | garden party
5 | completed | work on html
6 | deleted | need some rest
Credit #user80168
I agree with all other posters that say "don't do that" or "SQL isn't good at that". If you want to sort by some facet of comments then add another integer column to one of your tables to hold your sort criteria and sort by that value. eg "ORDER BY comments.sort DESC " If you want to sort these in a different order every time then... SQL won't be for you in this case.

PSQL - Select size of tables for both partitioned and normal

Thanks in advance for any help with this, it is highly appreciated.
So, basically, I have a Greenplum database and I am wanting to select the table size for the top 10 largest tables. This isn't a problem using the below:
select
sotaidschemaname schema_name
,sotaidtablename table_name
,pg_size_pretty(sotaidtablesize) table_size
from gp_toolkit.gp_size_of_table_and_indexes_disk
order by 3 desc
limit 10
;
However I have several partitioned tables in my database and these show up with the above sql as all their 'child tables' split up into small fragments (though I know they accumalate to make the largest 2 tables). Is there a way of making a script that selects tables (partitioned or otherwise) and their total size?
Note: I'd be happy to include some sort of join where I specify the partitoned table-name specifically as there are only 2 partitioned tables. However, I would still need to take the top 10 (where I cannot assume the partitioned table(s) are up there) and I cannot specify any other table names since there are near a thousand of them.
Thanks again,
Vinny.
Your friends would be pg_relation_size() function for getting relation size and you would select pg_class, pg_namespace and pg_partition joining them together like this:
select schemaname,
tablename,
sum(size_mb) as size_mb,
sum(num_partitions) as num_partitions
from (
select coalesce(p.schemaname, n.nspname) as schemaname,
coalesce(p.tablename, c.relname) as tablename,
1 as num_partitions,
pg_relation_size(n.nspname || '.' || c.relname)/1000000. as size_mb
from pg_class as c
inner join pg_namespace as n on c.relnamespace = n.oid
left join pg_partitions as p on c.relname = p.partitiontablename and n.nspname = p.partitionschemaname
) as q
group by 1, 2
order by 3 desc
limit 10;
select * from
(
select schemaname,tablename,
pg_relation_size(schemaname||'.'||tablename) as Size_In_Bytes
from pg_tables
where schemaname||'.'||tablename not in (select schemaname||'.'||partitiontablename from pg_partitions)
and schemaname||'.'||tablename not in (select distinct schemaname||'.'||tablename from pg_partitions )
union all
select schemaname,tablename,
sum(pg_relation_size(schemaname||'.'||partitiontablename)) as Size_In_Bytes
from pg_partitions
group by 1,2) as foo
where Size_In_Bytes >= '0' order by 3 desc;

Emulating an interval join in hive

I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.

Multiplying factors received through a SELECT JOIN command

I use a MYSQL command as follows:
UPDATE TABLE 1 FROM TABLE1 JOIN TABLE2 USING (COLUMN1)
SET TABLE1.AMOUNT = TABLE1.AMOUNT * TABLE2.FACTOR
According to this JOIN, there should be 3 rows returned from TABLE2 (say with factos 2, 3 and 4) but the TABLE1.AMOUNT only multiply the FACTOR in the first row and not the 2nd and 3rd row.
I expect to get the original AMOUNT x (2x3x4) BUT I get the value AMOUNT x 2
How do I solve this? Thanks for your help.
An UPDATE statement only updates a given row once. You need to replace TABLE2 with a subquery that produces the right multiplier. Unfortunately, MySQL doesn't have any multiplicative counterpart to SUM for multiplying a group of values together, but if you can accept some extra roundoff error, I suppose you could write:
UPDATE table1
FROM table1
JOIN ( SELECT column1,
EXP(SUM(LN(table2.factor))) AS total_factor
FROM table2
GROUP
BY column1
) subquery2
USING (column1)
SET table1.amount = table1.amount * subquery2.total_factor
;
(using the fact that Πak = eΣlnak).

Resources