Get incremental changes between Hive partitions

Get incremental changes between Hive partitions - join

I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?

This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.

So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user

Related

SQL Server Query Performance - Normal Join vs Subquery

I have two queries that return the same data.
Query1, which is normal join takes a long time to execute:
SELECT TOP 1000 bigtable.*, tbl1.name, tb2.name FROM
bigtable INNER JOIN tbl1 on bigtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by bigtable.id desc
Query2 that uses a sub-query returns fairly quickly:
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
bigtable contains 100k rows or so. tbl1 is a very small table (less than 10 rows). I would rather not use subqueries. If I skip the order by clause, both queries run quickly. I have tried adding indexes to the fields being joined, adding a DESC index on id etc. but nothing seems to help.
Any help is appreciated!
===> Update:
This turned out to be an non-issue. After creating another table similar to tbl1 with the same rows, I found that the Query1 ran under a second (with the copied table). Rebuilt stats on tbl1 and it fixed it.

I think that the two queries are not equivalent - try to write the second one as
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable order by bigtable.id desc) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
I expect the expensive operation to be the ordering of the big table, which is now present in both versions.

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.

If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]

How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

Hive: Not in subquery join

I'm looking for a way to select all values from one table which do no exits in other table. This needs to be done on two variables, not one.
select * from tb1
where tb1.id1 not in (select id1 from tb2)
and tb1.id2 not in (select id2 from tb2)
I cannot use subquery. It needs to be done using joins only.
I tried this:
select * from tb1 full join tb2 on
tb1.id1=tb2.id1 and tb1.id2=tb2.id2
This works fine with one variable in condition, but not two.
Please suggest some resolution.

Since you are looking to get all the data from tb1 with no common data on columns id1 and id2 on tb2, You can use a left outer join on table tb1. Something like
SELECT tb1.* FROM tb1 LEFT OUTER JOIN tb2 ON
(tb1.id1=tb2.id1 AND tb1.id2=tb2.id2)
WHERE tb2.id1 IS NULL

PSQL - Select size of tables for both partitioned and normal

Thanks in advance for any help with this, it is highly appreciated.
So, basically, I have a Greenplum database and I am wanting to select the table size for the top 10 largest tables. This isn't a problem using the below:
select
sotaidschemaname schema_name
,sotaidtablename table_name
,pg_size_pretty(sotaidtablesize) table_size
from gp_toolkit.gp_size_of_table_and_indexes_disk
order by 3 desc
limit 10
;
However I have several partitioned tables in my database and these show up with the above sql as all their 'child tables' split up into small fragments (though I know they accumalate to make the largest 2 tables). Is there a way of making a script that selects tables (partitioned or otherwise) and their total size?
Note: I'd be happy to include some sort of join where I specify the partitoned table-name specifically as there are only 2 partitioned tables. However, I would still need to take the top 10 (where I cannot assume the partitioned table(s) are up there) and I cannot specify any other table names since there are near a thousand of them.
Thanks again,
Vinny.

Your friends would be pg_relation_size() function for getting relation size and you would select pg_class, pg_namespace and pg_partition joining them together like this:
select schemaname,
tablename,
sum(size_mb) as size_mb,
sum(num_partitions) as num_partitions
from (
select coalesce(p.schemaname, n.nspname) as schemaname,
coalesce(p.tablename, c.relname) as tablename,
1 as num_partitions,
pg_relation_size(n.nspname || '.' || c.relname)/1000000. as size_mb
from pg_class as c
inner join pg_namespace as n on c.relnamespace = n.oid
left join pg_partitions as p on c.relname = p.partitiontablename and n.nspname = p.partitionschemaname
) as q
group by 1, 2
order by 3 desc
limit 10;

select * from
(
select schemaname,tablename,
pg_relation_size(schemaname||'.'||tablename) as Size_In_Bytes
from pg_tables
where schemaname||'.'||tablename not in (select schemaname||'.'||partitiontablename from pg_partitions)
and schemaname||'.'||tablename not in (select distinct schemaname||'.'||tablename from pg_partitions )
union all
select schemaname,tablename,
sum(pg_relation_size(schemaname||'.'||partitiontablename)) as Size_In_Bytes
from pg_partitions
group by 1,2) as foo
where Size_In_Bytes >= '0' order by 3 desc;

select multiple columns from different tables and join in hive

I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.

Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Get incremental changes between Hive partitions - join

This would work SELECT a.* FROM A a LEFT OUTER JOIN A b ON a.id = b.id WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>; Efficiently, this would span 1 MapReduce job only.

Related

SQL Server Query Performance - Normal Join vs Subquery

Redshift - Efficient JOIN clause with OR

Hive: Not in subquery join

PSQL - Select size of tables for both partitioned and normal

select multiple columns from different tables and join in hive

Categories

Resources