My current workflow is as follows.
However, it is extremely slow. It only can handle millions of data per/day
I want to speed it up. Any idea?
query_expression (generated by Ruby rake file, then put the generated SQL expression into ActiveRecord::Base.connection.execute)
Step 1 of sample-1-query.sql : Aggregate data by minute, hour, ...
LEFT JOIN
(
SELECT DISTINCT ON (1)
date_trunc('#{frequence}', ticktime) AS ticktime ,
Step 2 of sample-1-query.sql : Filling the empty/gap
FROM
(
SELECT DISTINCT ON (1) generate_series
(
date_trunc('second', min(ticktime)::TIMESTAMP),
max(ticktime)::TIMESTAMP,
query_expression
SELECT DISTINCT ON (time)
time_series.ticktime AS time,
t.high,
t.low,
t.open,
t.close,
t.volume,
t.product_type,
t.contract_month
FROM
(
SELECT DISTINCT ON (1) generate_series
(
date_trunc('second', min(ticktime)::TIMESTAMP),
max(ticktime)::TIMESTAMP,
'1 #{frequence}'::interval
) AS ticktime FROM #{market} WHERE product_type ='#{product_type}' AND contract_month = '#{contract_month}'::timestamp
) time_series
LEFT JOIN
(
SELECT DISTINCT ON (1)
date_trunc('#{frequence}', ticktime) AS ticktime ,
first_value(last_price) OVER w AS open,
max(last_price) OVER w AS high ,
min(last_price) OVER w AS low,
last_value(last_price) OVER w AS close,
sum(last_volume) OVER w AS volume,
product_type,
contract_month
FROM #{market}
WHERE product_type ='#{product_type}'
AND contract_month = '#{contract_month}'::timestamp
WINDOW w AS (PARTITION BY date_trunc('#{frequence}', ticktime) ORDER BY ticktime
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
) t USING (ticktime)
WHERE time_series.ticktime::time >= '#{market_begin_end_time[market]["begin_at"]}'::time
AND time_series.ticktime::time < '#{market_begin_end_time[market]["end_at"]}'::time
AND time_series.ticktime > '#{sampling_begin_time}'::TIMESTAMP
ORDER BY 1
Then
Rake file
ActiveRecord::Base.connection.execute(query_expression).each_with_index do |raw_record, j|
Model.create(raw_record)
end
Related
Summary:
In Snowflake I have a table which records the maximum number of an item which changes every so often. I want to be able to join the max number of the item for that date (effective_date). This is the most basic "example" as in my table has items "expire" when they are removed.
CREATE OR REPLACE TABLE ITEM
(
Item VARCHAR(10),
Quantity Number(5,0),
EFFECTIVE_DATE DATE
)
;
CREATE OR REPLACE TABLE REPORT
(
INVOICE_DATE DATE,
ITEM VARCHAR(10)
)
;
INSERT INTO REPORT
VALUES
('2021-02-01', '100'),
('2021-09-10', '100')
;
INSERT INTO ITEM
VALUES
('100', '10', '2021-01-01'),
('101', '15', '2021-01-01'),
('100', '5', '2021-09-01')
;
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
Returns
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,5,2021-09-01
2021-09-10,100,NULL,NULL,NULL
How do I fix this so I no longer get NULL entries on my join.
Thank you for reading this!
I am hoping to get a result like this
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,10,2021-01-01
2021-09-10,100,100,5,2021-09-01
The issue is with your data and your expectations. Your query is this:
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
which requires that the INVOICE_DATE be less than or equal to the EFFECTIVE DATE of the ITEM. This isn't the case, though. 2021-09-10 is greater than 2021-09-01 so you don't get a join hit, which is why you get NULLs. It's also why your other record is returning the wrong information from your expectations.
I am playing around with Clickhouse DB and I am trying to figure out why the query below is giving me a DB::Exception: Memory limit (for query) exceeded and could use some help...
SELECT * FROM
(
SELECT created_at, rates.car_id, MIN(rates.price) FROM rates
WHERE
pickup_location_id = 198
AND created_at = '2020-10-01'
GROUP BY created_at, car_id
) r
JOIN cars c2 ON r.car_id = c2.id
The inner query bit performs almost instantly (millions of records) and yields only 212 results. However, adding the JOIN causes the query to fail (memory exception, 45GB)
Looks like the JOIN happens on the whole of rates/cars - and not on the "result"?
CH uses HASHJOIN and places the right table into memory into a HashTable.
In case of inner join you can swap tables:
SELECT * FROM cars c2 JOIN
(
SELECT created_at, rates.car_id, MIN(rates.price) FROM rates
WHERE
pickup_location_id = 198
AND created_at = '2020-10-01'
GROUP BY created_at, car_id
) r
ON r.car_id = c2.id
I have the following data:
INSERT table1,adwords_id=123-456-7890 total_spending=0 1538377201000000000
INSERT table1,adwords_id=123-456-7890 total_spending=110 1538463601000000000
INSERT table1,adwords_id=123-456-7890 total_spending=120 1538550001000000000
And I want to write a query to find the difference of total_spending between two timestamps.
For example, let say I want to find the difference of total_spending between 1538377201000000000 + 1h and 1538550001000000000 + 1h
Doing it line by line will be:
v1 = SELECT last(total_spending) FROM table1 WHERE "adwords_id" = '123-456-7890' AND time < 1538377201000000000 + 1h
v2 = SELECT last(total_spending) FROM table1 WHERE "adwords_id" = '123-456-7890' AND time < 1538550001000000000 + 1h
And the answer will be v2-v1
How can I do this in one query? (so I can run this across many adwords_id)
The below should work for you.
You need to have variable values for #t1, #t2 and #adwordsId.
SELECT
(
SELECT TOP 1 total_spending
FROM Table1
WHERE adwords_id = #adwordsId
AND time < #t2
ORDER BY time DESC
) -
(
SELECT TOP 1 total_spending
FROM Table1
WHERE adwords_id = #adwordsId
AND time < #t1
ORDER BY time DESC
)
Assuming total_spending is cumulative for each adwords_id, I would try this:
SELECT last("total_spending") - first("total_spending") FROM Table1 WHERE time >= t1 AND time <= t2 GROUP BY "adwords_id"
Thanks in advance for any help with this, it is highly appreciated.
So, basically, I have a Greenplum database and I am wanting to select the table size for the top 10 largest tables. This isn't a problem using the below:
select
sotaidschemaname schema_name
,sotaidtablename table_name
,pg_size_pretty(sotaidtablesize) table_size
from gp_toolkit.gp_size_of_table_and_indexes_disk
order by 3 desc
limit 10
;
However I have several partitioned tables in my database and these show up with the above sql as all their 'child tables' split up into small fragments (though I know they accumalate to make the largest 2 tables). Is there a way of making a script that selects tables (partitioned or otherwise) and their total size?
Note: I'd be happy to include some sort of join where I specify the partitoned table-name specifically as there are only 2 partitioned tables. However, I would still need to take the top 10 (where I cannot assume the partitioned table(s) are up there) and I cannot specify any other table names since there are near a thousand of them.
Thanks again,
Vinny.
Your friends would be pg_relation_size() function for getting relation size and you would select pg_class, pg_namespace and pg_partition joining them together like this:
select schemaname,
tablename,
sum(size_mb) as size_mb,
sum(num_partitions) as num_partitions
from (
select coalesce(p.schemaname, n.nspname) as schemaname,
coalesce(p.tablename, c.relname) as tablename,
1 as num_partitions,
pg_relation_size(n.nspname || '.' || c.relname)/1000000. as size_mb
from pg_class as c
inner join pg_namespace as n on c.relnamespace = n.oid
left join pg_partitions as p on c.relname = p.partitiontablename and n.nspname = p.partitionschemaname
) as q
group by 1, 2
order by 3 desc
limit 10;
select * from
(
select schemaname,tablename,
pg_relation_size(schemaname||'.'||tablename) as Size_In_Bytes
from pg_tables
where schemaname||'.'||tablename not in (select schemaname||'.'||partitiontablename from pg_partitions)
and schemaname||'.'||tablename not in (select distinct schemaname||'.'||tablename from pg_partitions )
union all
select schemaname,tablename,
sum(pg_relation_size(schemaname||'.'||partitiontablename)) as Size_In_Bytes
from pg_partitions
group by 1,2) as foo
where Size_In_Bytes >= '0' order by 3 desc;
I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.
I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here
Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;