multiple select take ages with snowflake - join

i have a table with 6M rows and it seems my query take ages.
I try to calculate values for 2 rolling months.
Input:
Period
ID
Tag
Name
Program
Total Cost
2017-06-01
ID1
X
User1
Program 1
438
2020-12-01
ID2
A
User2
Program 2
118
2020-12-01
ID3
X
User3
Program 3
380
Wanted output:
Period
ID
Tag
Name
Program
Total Cost
Period M-1
Total Cost M-1
Period M-2
Total Cost M-2
2017-06-01
ID1
X
User1
Program 1
438
2017-05-01
372
2017-04-01
340
2020-12-01
ID2
A
User2
Program 2
118
2020-11-01
103
2020-10-01
98
2020-12-01
ID3
X
User3
Program 3
380
2020-11-01
362
2020-10-01
334
Where am i wrong ? The below query is very slow.
WITH month_M AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD",
DATEADD(MONTH, -1, "Period" ) AS "Period M-1 ",
DATEADD(MONTH, -2, "Period" ) AS "Period M-2"
FROM "ARROWSPHERE_PROD_DB"."PBI_SCH"."Revenue_Dashboard"
), month_M1 AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD"
FROM "ARROWSPHERE_PROD_DB"."PBI_SCH"."Revenue_Dashboard"
), month_M2 AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD"
FROM "ARROWSPHERE_PROD_DB"."PBI_SCH"."Revenue_Dashboard"
)
SELECT M."Period",M."ID",M."Tag",M."Name",M."Program",M."Cost USD",
M."Period M-1 ",M1."Cost USD" AS "Total Cost M-1",M."Period M-2",M2."Cost USD" AS "Total Cost M-2"
FROM month_M AS M,month_M1 AS M1, month_M2 AS M2
WHERE M."Period M-1 "=M1."Period" AND M."Period M-2"=M2."Period"
AND M."ID"=M1."ID" AND M."ID"=M2."ID"
AND M."Tag"=M1."Tag" AND M."Tag"=M2."Tag"
AND M."Name"=M1."Name" AND M."Name"=M2."Name"
AND M."Program"=M1."Program" AND M."Program"=M2."Program"

You can achieve your goal by using a Window Function like LAG, and reducing drastically your SQL code complexity and the execution plan that will perform the operation, which I guess will require one single table scan only (https://docs.snowflake.com/en/sql-reference/functions/lag.html)
CREATE OR REPLACE TEMPORARY TABLE TMP_TEST (
Period TIMESTAMP,
ID VARCHAR,
Tag VARCHAR,
Name VARCHAR,
Program VARCHAR,
TotalCost NUMERIC
);
INSERT INTO TMP_TEST
VALUES
('2020-10-01', 'ID2', 'A', 'User2', 'Program 2', 98),
('2020-11-01', 'ID2', 'A', 'User2', 'Program 2', 103),
('2020-12-01', 'ID2', 'A', 'User2', 'Program 2', 118),
('2020-10-01', 'ID3', 'X', 'User3', 'Program 3', 334),
('2020-11-01', 'ID3', 'X', 'User3', 'Program 3', 362),
('2020-12-01', 'ID3', 'X', 'User3', 'Program 3', 380);
SELECT * ,
DATEADD(MONTH, -1, Period) AS "Period M-1",
LAG(TotalCost, 1, 0) over (PARTITION BY Id, Tag, Name ORDER BY Period) AS "TotalCost M-2",
DATEADD(MONTH, -2, Period) AS "Period M-2",
LAG(TotalCost, 2, 0) OVER (PARTITION BY Id, Tag, Name ORDER BY Period) AS "TotalCost M-1"
FROM TMP_TEST
ORDER BY Id, Tag, Name, Period;

This is valid SQL so it's not "wrong" but since there are no predicates Snowflake must do a full table scan of 6e8 records, do processing and return about as many rows ...which is a lot of work to do.
If you can't just temporarily use a bigger warehouse, then you will have to dig into the Query Profile to find the bottleneck by clicking the query_id and then the "Profile" tab from the Worksheet UI.
First look at the Profile Overview and look at the breakdown of Remote IO to Processing.
You can reduce Remote IO by selecting fewer columns (if possible) or by using a predicate (like 1 year at a time, or users that start with X, or something... you may have to experiment.) You can click on a step to see how much was able to be pruned.
You can reduce processing by doing less :) which won't be easy but you could try a left join (example below) or a window query.
WITH rev_dash as (select $1 "Period", $2 "ID", $3 "Tag", $4 "Name", $5 "Program", $6 "Cost USD" from values
('2017-06-01', 'ID1', 'X', 'User1', 'Program 1', '438'),
('2020-12-01', 'ID2', 'A', 'User2', 'Program 2', '118'),
('2020-12-01', 'ID3', 'X', 'User3', 'Program 3', '380'),
('2017-05-01', 'ID1', 'X', 'User1', 'Program 1', '438'),
('2020-11-01', 'ID2', 'A', 'User2', 'Program 2', '118'),
('2020-11-01', 'ID3', 'X', 'User3', 'Program 3', '380'),
('2017-04-01', 'ID1', 'X', 'User1', 'Program 1', '438'),
('2020-10-01', 'ID2', 'A', 'User2', 'Program 2', '118'),
('2020-10-01', 'ID3', 'X', 'User3', 'Program 3', '380')
)
, month_M AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD",
DATEADD(MONTH, -1, "Period" ) AS "Period M-1 ",
DATEADD(MONTH, -2, "Period" ) AS "Period M-2"
FROM rev_dash
), month_M1 AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD"
FROM rev_dash
), month_M2 AS (
SELECT "Period","ID","Tag","Name","Program","Cost USD"
FROM rev_dash
)
SELECT M."Period",M."ID",M."Tag",M."Name",M."Program",M."Cost USD", M."Period M-1 ",M1."Cost USD" AS "Total Cost M-1",M."Period M-2",M2."Cost USD" AS "Total Cost M-2"
FROM month_M AS M left join month_M1 AS M1 left join month_M2 AS M2
on M."Period M-1 "=M1."Period" AND M."Period M-2"=M2."Period"
AND M."ID"=M1."ID" AND M."ID"=M2."ID"
AND M."Tag"=M1."Tag" AND M."Tag"=M2."Tag"
AND M."Name"=M1."Name" AND M."Name"=M2."Name"
AND M."Program"=M1."Program" AND M."Program"=M2."Program"
where "Total Cost M-2" is not null;

Related

postgresql Get latest value before date

Let's say I have the following Inventory table.
id item_id stock_amount Date
(1, 1, 10, '2020-01-01T00:00:00')
(2, 1, 9, '2020-01-02T00:00:00')
(3, 1, 8, '2020-01-02T10:00:00')
(4, 3, 11, '2020-01-03T00:00:00')
(5, 3, 13, '2020-01-04T00:00:00')
(6, 4, 7, '2020-01-05T00:00:00')
(7, 2, 12, '2020-01-06T00:00:00')
Basically, per each day, I want to get the sum of stock_amount for each unique item_id but it should exclude the current day's stock amount. The item_id chosen should be the latest one. This is to calculate the starting stock on each day. So the response in this case would be:
Date starting_amount
'2020-01-01T00:00:00' 0
'2020-01-02T00:00:00' 10
'2020-01-03T00:00:00' 8
'2020-01-04T00:00:00' 19 -- # -> 11 + 8 (id 5 + id 3)
'2020-01-05T00:00:00' 21 -- # -> 13 + 8
'2020-01-06T00:00:00' 28 -- # -> 7 + 13 + 8
Any help would be greatly appreciated.
Using nested subqueries like this:
select
Date,
coalesce(sum(stock_amount), 0) starting_amount
from
(
select
row_number() over(partition by i1.Date, item_id order by i2.Date desc) i,
i1.Date,
i2.item_id,
i2.stock_amount
from
(select distinct date_trunc('day', Date) as Date from Inventory) i1
left outer join
Inventory i2
on i2.Date < i1.Date
) s
where i = 1
group by Date
order by Date
This query sorts in descending order and uses the first row.

Elixir Accumulator List of Maps

Can you help me to implement one Accumulator from List of maps?.
[
%{
score: 1,
name: "Javascript",
},
%{
score: 2,
name: "Elixir",
},
%{
score: 10,
name: "Elixir",
}
]
The result should be:
[
%{
score: 12,
name: "Elixir",
},
%{
score: 1,
name: "Javascript",
}
]
I will appreciate your suggestion.
Regards
Assuming your original list is stored in input local variable, one might start with Enum.reduce/3 using Map.update/4 as a reducer.
Enum.reduce(input, %{}, fn %{score: score, name: name}, acc ->
Map.update(acc, name, score, & &1 + score)
end)
#⇒ %{"Elixir" => 12, "Javascript" => 1}
Whether you insist on having a list of maps as a result (which is way less readable IMSO,) go further and Enum.map/2 the result:
Enum.map(%{"Elixir" => 12, "Javascript" => 1}, fn {name, score} ->
%{name: name, score: score}
end)
#⇒ [%{name: "Elixir", score: 12},
# %{name: "Javascript", score: 1}]
To sum it up:
input
|> Enum.reduce(%{}, fn %{score: score, name: name}, acc ->
Map.update(acc, name, score, & &1 + score)
end)
|> Enum.map(& %{name: elem(&1, 0), score: elem(&1, 1)})
#⇒ [%{name: "Elixir", score: 12},
# %{name: "Javascript", score: 1}]
Sidenote: maps in erlang (and, hence, in elixir) are not ordered. That means, if you want the resulting list to be sorted by name, or by score, you should explicitly Enum.sort/2 it:
Enum.sort(..., & &1.score > &2.score)
#⇒ [%{name: "Elixir", score: 12},
# %{name: "Javascript", score: 1}]
A simple way could be to use Enum.group_by/3 to group the items by name, then Enum.sum/1 to sum the scores:
list
|> Enum.group_by(& &1.name, & &1.score)
|> Enum.map(fn {name, score} -> %{name: name, score: Enum.sum(score)} end)
Output:
[%{name: "Elixir", score: 12}, %{name: "Javascript", score: 1}]
If you were looking to create & use a more generalized solution, you could create your own Merger module.
defmodule Merger do
def merge_by(enumerable, name_fun, merge_fun) do
enumerable
|> Enum.group_by(name_fun)
|> Enum.map(fn {_name, items} -> Enum.reduce(items, merge_fun) end)
end
end
list = [
%{score: 1, name: "Javascript"},
%{score: 2, name: "Elixir"},
%{score: 10, name: "Elixir"}
]
Merger.merge_by(list, & &1.name, &%{&1 | score: &1.score + &2.score})
# => [%{name: "Elixir", score: 12}, %{name: "Javascript", score: 1}]

Attempting a transpose by performing multiple joins of table on subsets of same table in Hive

I'm attempting to perform a transpose on the column date by performing multiple joins of my table data_A on subsets of the same table:
Here's the code to create my test dataset, which contains duplicate records for every value of count:
create table database.data_A (member_id string, x1 int, x2 int, count int, date date);
insert into table database.data_A
select 'A0001',1, 10, 1, '2017-01-01'
union all
select 'A0001',1, 10, 2, '2017-07-01'
union all
select 'A0001',2, 20, 1, '2017-01-01'
union all
select 'A0001',2, 20, 2, '2017-07-01'
union all
select 'B0001',3, 50, 1, '2017-03-01'
union all
select 'C0001',4, 100, 1, '2017-04-01'
union all
select 'D0001',5, 200, 1, '2017-10-01'
union all
select 'D0001',5, 200, 2, '2017-11-01'
union all
select 'D0001',5, 200, 3, '2017-12-01'
union all
select 'D0001',6, 500, 1, '2017-10-01'
union all
select 'D0001',6, 500, 2, '2017-11-01'
union all
select 'D0001',6, 500, 3, '2017-12-01'
union all
select 'D0001',7, 1000, 1, '2017-10-01'
union all
select 'D0001',7, 1000, 2, '2017-11-01'
union all
select 'D0001',7, 1000, 3, '2017-12-01';
I'd like to transpose the data into this:
member_id x1 x2 date1 date2 date3
'A0001', 1, 10, '2017-01-01' '2017-07-01' .
'A0001', 2, 20, '2017-01-01' '2017-07-01' .
'B0001', 3, 50, '2017-03-01' . .
'C0001', 4, 100, '2017-04-01' . .
'D0001', 5, 200, '2017-10-01' '2017-11-01' '2017-12-01'
'D0001', 6, 500, '2017-10-01' '2017-11-01' '2017-12-01'
'D0001', 7, 1000, '2017-10-01' '2017-11-01' '2017-12-01'
My first program (which was not successful):
create table database.data_B as
select a.member_id, a.x1, a.x2, a.date_1, b.date_2, c.date_3
from (select member_id, x1, x2, date as date_1 from database.data_A where count=1) as a
left join
(select member_id, date as date_2 from database.data_A where count=2) as b
on (a.member_id=b.member_id)
left join
(select member_id, date as date_3 from database.data_A where count=3) as c
on (a.member_id=c.member_id);
Below will do the job.
select
member_id,
x1,
x2,
max(case when count=1 then date1 else '.' end) as date11,
max(case when count=2 then date1 else '.' end) as date2,
max(case when count=3 then date1 else '.' end) as date3
from data_A
group by member_id,x1, x2

Google Spreadsheet: Use column + n in formular

In my spreadsheet in Column X I have the following formular:
=ImportRange('_keys'!$B$2;"2015!A200:A203")
Now I'd like to copy this formular to column X+n (in this case X+2) so that it should look like:
=ImportRange('_keys'!$B$2;"2015!C200:C203")
But it doesn't change the column and I have to change it by hand.
Is it possible to change this formular that it always uses the column where the formular is in?
You can use the COLUMN() function to get the column of the current cell as a number. Using ADDRESS() you can turn it into a cell reference string. See the docs for COLUMN and ADDRESS.
Your code becomes
=ImportRange('_keys'!$B$2;
CONCATENATE("2015!", ADDRESS(200, COLUMN()-Y, 4),
":", ADDRESS(203, COLUMN()-Y, 4))
)
where Y is the offset between column A and column X (where this formula is located). The third argument of ADDRESS makes both the row and column relative (without the $). Note that the order of arguments to ADDRESS is row then column, annoyingly.
My solution:
I wrote a simple custom function that converts numbers into letters.
/**
* Converts number of column into column letter
*
* #param {Number} aNumer Number of column
* #return {String} Letter of column
* #customfunction
*/
function COL_NR2LETTER(aNumber) {
var letterArray = ['-', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ'];
if (aNumber < 1 || aNumber > letterArray.length)
throw "column index out of bound error";
return letterArray [aNumber];
}
Now its possible to copy
=ImportRange('_keys'!$B$2;
"2015!" & COL_NR2LETTER(Column(A1)) &"200:"& COL_NR2LETTER(Column(A1)) &"203")
from Column X into a column X+n.

py2neo unique nodes with unique relations given timestamp

I am trying to create a graph that stores time based iterations between nodes. I would like the nodes to be unique and relationships between nodes to be unique given the timestamp property.
My first attempt creates 2 nodes and 1 relationship which is not what I want.
from py2neo import neo4j, node, rel
graph_db = neo4j.GraphDatabaseService()
graph_db.get_or_create_index(neo4j.Node, "node_index")
batch = neo4j.WriteBatch(graph_db)
# a TALKED_TO b at timestamp 0
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.get_or_create_indexed_relationship('rel_index', 'type', 'TALKED_TO', 0, 'TALKED_TO', 1, {"timestamp": 0})
# a TALKED_TO b at timestamp 1
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.get_or_create_indexed_relationship('rel_index', 'type', 'TALKED_TO', 3, 'TALKED_TO', 4, {"timestamp": 1})
# a TALKED_TO b at timestamp 2
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.get_or_create_indexed_relationship('rel_index', 'type', 'TALKED_TO', 6, 'TALKED_TO', 7, {"timestamp": 0})
results = batch.submit()
print results
#[Node('http://localhost:7474/db/data/node/2'),
#Node('http://localhost:7474/db/data/node/3'),
#Relationship('http://localhost:7474/db/data/relationship/0'),
#Node('http://localhost:7474/db/data/node/2'),
#Node('http://localhost:7474/db/data/node/3'),
#Relationship('http://localhost:7474/db/data/relationship/0'),
#Node('http://localhost:7474/db/data/node/2'),
#Node('http://localhost:7474/db/data/node/3'),
#Relationship('http://localhost:7474/db/data/relationship/0')]
My second attempt creates 2 nodes and 0 relations, not sure why it fails to create any relationships.
from py2neo import neo4j, node, rel
graph_db = neo4j.GraphDatabaseService()
graph_db.get_or_create_index(neo4j.Node, "node_index")
batch = neo4j.WriteBatch(graph_db)
# a TALKED_TO b at timestamp 0
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.create(rel(0, 'TALKED_TO', 1, {"timestamp": 0}))
# a TALKED_TO b at timestamp 1
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.create(rel(3, 'TALKED_TO', 4, {"timestamp": 1}))
# a TALKED_TO b at timestamp 2
batch.get_or_create_indexed_node('node_index', 'name', 'a', {'name': 'a'})
batch.get_or_create_indexed_node('node_index', 'name', 'b', {'name': 'b'})
batch.create(rel(6, 'TALKED_TO', 7, {"timestamp": 0}))
results = batch.submit()
print results
#[Node('http://localhost:7474/db/data/node/2'),
#Node('http://localhost:7474/db/data/node/3'),
#None]
So how do I achieve what is depicted in the image below?
Okay so I think I figured it out but I'm not sure if its efficient. Does anyone know a better way than the following?
# Create nodes a and b if they do not exist.
query = """MERGE (p:Person { name: {name} }) RETURN p"""
cypher_query = neo4j.CypherQuery(neo4j_graph, query )
result = cypher_query .execute(name='a')
result = cypher_query .execute(name='b')
# Create a relationship between a and b if it does not exist with the given timestamp value.
query = """
MATCH (a:Person {name: {a}}), (b:Person {name: {b}})
MERGE (a)-[r:TALKED_TO {timestamp: {timestamp}}]->(b)
RETURN r
"""
cypher_query = neo4j.CypherQuery(neo4j_graph, query)
result = cypher_query.execute(a='a', b='b', timestamp=0)
result = cypher_query.execute(a='a', b='b', timestamp=1)

Resources