Summary:
In Snowflake I have a table which records the maximum number of an item which changes every so often. I want to be able to join the max number of the item for that date (effective_date). This is the most basic "example" as in my table has items "expire" when they are removed.
CREATE OR REPLACE TABLE ITEM
(
Item VARCHAR(10),
Quantity Number(5,0),
EFFECTIVE_DATE DATE
)
;
CREATE OR REPLACE TABLE REPORT
(
INVOICE_DATE DATE,
ITEM VARCHAR(10)
)
;
INSERT INTO REPORT
VALUES
('2021-02-01', '100'),
('2021-09-10', '100')
;
INSERT INTO ITEM
VALUES
('100', '10', '2021-01-01'),
('101', '15', '2021-01-01'),
('100', '5', '2021-09-01')
;
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
Returns
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,5,2021-09-01
2021-09-10,100,NULL,NULL,NULL
How do I fix this so I no longer get NULL entries on my join.
Thank you for reading this!
I am hoping to get a result like this
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,10,2021-01-01
2021-09-10,100,100,5,2021-09-01
The issue is with your data and your expectations. Your query is this:
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
which requires that the INVOICE_DATE be less than or equal to the EFFECTIVE DATE of the ITEM. This isn't the case, though. 2021-09-10 is greater than 2021-09-01 so you don't get a join hit, which is why you get NULLs. It's also why your other record is returning the wrong information from your expectations.
I have a simple SQL query in PostgreSQL 8.3 that grabs a bunch of comments. I provide a sorted list of values to the IN construct in the WHERE clause:
SELECT * FROM comments WHERE (comments.id IN (1,3,2,4));
This returns comments in an arbitrary order which in my happens to be ids like 1,2,3,4.
I want the resulting rows sorted like the list in the IN construct: (1,3,2,4).
How to achieve that?
You can do it quite easily with (introduced in PostgreSQL 8.2) VALUES (), ().
Syntax will be like this:
select c.*
from comments c
join (
values
(1,1),
(3,2),
(2,3),
(4,4)
) as x (id, ordering) on c.id = x.id
order by x.ordering
In Postgres 9.4 or later, this is simplest and fastest:
SELECT c.*
FROM comments c
JOIN unnest('{1,3,2,4}'::int[]) WITH ORDINALITY t(id, ord) USING (id)
ORDER BY t.ord;
WITH ORDINALITY was introduced with in Postgres 9.4.
No need for a subquery, we can use the set-returning function like a table directly. (A.k.a. "table-function".)
A string literal to hand in the array instead of an ARRAY constructor may be easier to implement with some clients.
For convenience (optionally), copy the column name we are joining to ("id" in the example), so we can join with a short USING clause to only get a single instance of the join column in the result.
Works with any input type. If your key column is of type text, provide something like '{foo,bar,baz}'::text[].
Detailed explanation:
PostgreSQL unnest() with element number
Just because it is so difficult to find and it has to be spread: in mySQL this can be done much simpler, but I don't know if it works in other SQL.
SELECT * FROM `comments`
WHERE `comments`.`id` IN ('12','5','3','17')
ORDER BY FIELD(`comments`.`id`,'12','5','3','17')
With Postgres 9.4 this can be done a bit shorter:
select c.*
from comments c
join (
select *
from unnest(array[43,47,42]) with ordinality
) as x (id, ordering) on c.id = x.id
order by x.ordering;
Or a bit more compact without a derived table:
select c.*
from comments c
join unnest(array[43,47,42]) with ordinality as x (id, ordering)
on c.id = x.id
order by x.ordering
Removing the need to manually assign/maintain a position to each value.
With Postgres 9.6 this can be done using array_position():
with x (id_list) as (
values (array[42,48,43])
)
select c.*
from comments c, x
where id = any (x.id_list)
order by array_position(x.id_list, c.id);
The CTE is used so that the list of values only needs to be specified once. If that is not important this can also be written as:
select c.*
from comments c
where id in (42,48,43)
order by array_position(array[42,48,43], c.id);
I think this way is better :
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY id=1 DESC, id=3 DESC, id=2 DESC, id=4 DESC
Another way to do it in Postgres would be to use the idx function.
SELECT *
FROM comments
ORDER BY idx(array[1,3,2,4], comments.id)
Don't forget to create the idx function first, as described here: http://wiki.postgresql.org/wiki/Array_Index
In Postgresql:
select *
from comments
where id in (1,3,2,4)
order by position(id::text in '1,3,2,4')
On researching this some more I found this solution:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY CASE "comments"."id"
WHEN 1 THEN 1
WHEN 3 THEN 2
WHEN 2 THEN 3
WHEN 4 THEN 4
END
However this seems rather verbose and might have performance issues with large datasets.
Can anyone comment on these issues?
To do this, I think you should probably have an additional "ORDER" table which defines the mapping of IDs to order (effectively doing what your response to your own question said), which you can then use as an additional column on your select which you can then sort on.
In that way, you explicitly describe the ordering you desire in the database, where it should be.
sans SEQUENCE, works only on 8.4:
select * from comments c
join
(
select id, row_number() over() as id_sorter
from (select unnest(ARRAY[1,3,2,4]) as id) as y
) x on x.id = c.id
order by x.id_sorter
SELECT * FROM "comments" JOIN (
SELECT 1 as "id",1 as "order" UNION ALL
SELECT 3,2 UNION ALL SELECT 2,3 UNION ALL SELECT 4,4
) j ON "comments"."id" = j."id" ORDER BY j.ORDER
or if you prefer evil over good:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY POSITION(','+"comments"."id"+',' IN ',1,3,2,4,')
And here's another solution that works and uses a constant table (http://www.postgresql.org/docs/8.3/interactive/sql-values.html):
SELECT * FROM comments AS c,
(VALUES (1,1),(3,2),(2,3),(4,4) ) AS t (ord_id,ord)
WHERE (c.id IN (1,3,2,4)) AND (c.id = t.ord_id)
ORDER BY ord
But again I'm not sure that this is performant.
I've got a bunch of answers now. Can I get some voting and comments so I know which is the winner!
Thanks All :-)
create sequence serial start 1;
select * from comments c
join (select unnest(ARRAY[1,3,2,4]) as id, nextval('serial') as id_sorter) x
on x.id = c.id
order by x.id_sorter;
drop sequence serial;
[EDIT]
unnest is not yet built-in in 8.3, but you can create one yourself(the beauty of any*):
create function unnest(anyarray) returns setof anyelement
language sql as
$$
select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
that function can work in any type:
select unnest(array['John','Paul','George','Ringo']) as beatle
select unnest(array[1,3,2,4]) as id
Slight improvement over the version that uses a sequence I think:
CREATE OR REPLACE FUNCTION in_sort(anyarray, out id anyelement, out ordinal int)
LANGUAGE SQL AS
$$
SELECT $1[i], i FROM generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
SELECT
*
FROM
comments c
INNER JOIN (SELECT * FROM in_sort(ARRAY[1,3,2,4])) AS in_sort
USING (id)
ORDER BY in_sort.ordinal;
select * from comments where comments.id in
(select unnest(ids) from bbs where id=19795)
order by array_position((select ids from bbs where id=19795),comments.id)
here, [bbs] is the main table that has a field called ids,
and, ids is the array that store the comments.id .
passed in postgresql 9.6
Lets get a visual impression about what was already said. For example you have a table with some tasks:
SELECT a.id,a.status,a.description FROM minicloud_tasks as a ORDER BY random();
id | status | description
----+------------+------------------
4 | processing | work on postgres
6 | deleted | need some rest
3 | pending | garden party
5 | completed | work on html
And you want to order the list of tasks by its status.
The status is a list of string values:
(processing, pending, completed, deleted)
The trick is to give each status value an interger and order the list numerical:
SELECT a.id,a.status,a.description FROM minicloud_tasks AS a
JOIN (
VALUES ('processing', 1), ('pending', 2), ('completed', 3), ('deleted', 4)
) AS b (status, id) ON (a.status = b.status)
ORDER BY b.id ASC;
Which leads to:
id | status | description
----+------------+------------------
4 | processing | work on postgres
3 | pending | garden party
5 | completed | work on html
6 | deleted | need some rest
Credit #user80168
I agree with all other posters that say "don't do that" or "SQL isn't good at that". If you want to sort by some facet of comments then add another integer column to one of your tables to hold your sort criteria and sort by that value. eg "ORDER BY comments.sort DESC " If you want to sort these in a different order every time then... SQL won't be for you in this case.
Thanks in advance for any help with this, it is highly appreciated.
So, basically, I have a Greenplum database and I am wanting to select the table size for the top 10 largest tables. This isn't a problem using the below:
select
sotaidschemaname schema_name
,sotaidtablename table_name
,pg_size_pretty(sotaidtablesize) table_size
from gp_toolkit.gp_size_of_table_and_indexes_disk
order by 3 desc
limit 10
;
However I have several partitioned tables in my database and these show up with the above sql as all their 'child tables' split up into small fragments (though I know they accumalate to make the largest 2 tables). Is there a way of making a script that selects tables (partitioned or otherwise) and their total size?
Note: I'd be happy to include some sort of join where I specify the partitoned table-name specifically as there are only 2 partitioned tables. However, I would still need to take the top 10 (where I cannot assume the partitioned table(s) are up there) and I cannot specify any other table names since there are near a thousand of them.
Thanks again,
Vinny.
Your friends would be pg_relation_size() function for getting relation size and you would select pg_class, pg_namespace and pg_partition joining them together like this:
select schemaname,
tablename,
sum(size_mb) as size_mb,
sum(num_partitions) as num_partitions
from (
select coalesce(p.schemaname, n.nspname) as schemaname,
coalesce(p.tablename, c.relname) as tablename,
1 as num_partitions,
pg_relation_size(n.nspname || '.' || c.relname)/1000000. as size_mb
from pg_class as c
inner join pg_namespace as n on c.relnamespace = n.oid
left join pg_partitions as p on c.relname = p.partitiontablename and n.nspname = p.partitionschemaname
) as q
group by 1, 2
order by 3 desc
limit 10;
select * from
(
select schemaname,tablename,
pg_relation_size(schemaname||'.'||tablename) as Size_In_Bytes
from pg_tables
where schemaname||'.'||tablename not in (select schemaname||'.'||partitiontablename from pg_partitions)
and schemaname||'.'||tablename not in (select distinct schemaname||'.'||tablename from pg_partitions )
union all
select schemaname,tablename,
sum(pg_relation_size(schemaname||'.'||partitiontablename)) as Size_In_Bytes
from pg_partitions
group by 1,2) as foo
where Size_In_Bytes >= '0' order by 3 desc;
I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.