I have a data request on a dataset with more than 50MM records to add some data fields based on date and timestamps
Dataset that I have
ID CODE ISSUEDATE ISSUETIME ORDERDATE ORDERTIME QTY
101 A51 2020-08-24 11:24:00 2020-08-21 09:25:00 100
101 777 2020-08-21 08:30:00 2020-08-21 08:30:00 125
101 888 2020-08-21 09:30:00 2020-08-21 09:30:00 145
102 A51 2020-08-23 11:24:00 2020-08-21 09:25:00 100
102 256 2020-08-20 08:30:00 2020-08-20 08:30:00 125
102 256 2020-08-20 11:24:00 2020-08-20 11:24:00 145
I need to pull the data for CODE='A51'
I want the following Dataset
ID CODE ISSUEDATE ISSUETIME ORDERDATE QTY ISSUEDATE2 ISSUUTIME2 QTY2
101 A51 2020-08-24 11:24:00 2020-08-21 100 2020-08-21 08:30:00 125
102 A51 2020-08-23 11:24:00 2020-08-21 100 2020-08-20 08:30:00 125
I need to create ORDERDATE2, ISSUETIME2, and QTY2 variables based on ORDERDATE and ISSUETIME of code value 'A51' for each ID.
ORDERDATE2 and ORDERTIME2 will look for ORDERDATE and ORDERTIME and pull the closest ISSUEDATE and ISSUETIME and QTY details from non 'A51' rows
In the above example - ID 101 has ORDERDATE as '2020-08-21' and ORDERTIME as 09:25 - therefore most recent non 'A51' record on 2020-08-21 for this ID is 08:30:00 with QTY 125
If there is no entry for ORDERDATE then the most recent to ISSUEDATE should be captured
In the above example - ID 102 has ORDERDATE as '2020-08-21' but there is no ISSUEDATE row for '2020-08-21' hence the closest to this date is ('2020-08-20', 08:30:00, and 125) captured.
I am a novice to Teradata Qualify statements and all that I am doing is create separate datasets for A51 and not A51 and join them on ID and time
Create table _A51_ as Select ID, ISSUEDATE, ISSUETIME, ORDERDATE, ORDERTIME, QTY from
Have where CODE='A51'
Create table _Non_A51_ as Select ID, ISSUEDATE, ISSUETIME, ORDERDATE, ORDERTIME, QTY from
Have where CODE ne 'A51'
Create table _A51B_ as Select
A.*,B.ISSUEDATE as ISSUEDATE2, B.ISSUETIME as ISSUETIME2
from _A51_ as A
inner join _Non_A51_ as B
where A.ID=B.ID and
A.ORDERDATE=B.ISSUEDATE
and A.ORDERTIME le B.ISSUETIME
QUALIFY ROW_NUMBER() OVER (PARTITION BY A.ID, A.ISSUEDATE ORDER BY B.ISSUETIME)=1
This wouldn't give me the rows for ID 102 - I hope there should be an easy way without me creating two datasets and join them back.
Any help here is much appreciated.
Related
I am trying to merge two SAS tables based on a third “bridge table” and perform some calculations during the process. The code should be like “For each good lookup the price and calculate the annual revenue.”
My raw data: one table with annual quantity of goods, one table with prices and one bridge table with the information which price is used for which good.
data work.goods;
input date date. GoodA GoodB GoodC;
format date year. ;
datalines;
01Jan20 10 12 2
01Jan21 12 11 5
run;`
data work.price;
input date date. PriceA PriceB;
format date year.;
datalines;
01Jan20 220 110
01Jan21 250 120
run;
data work.bridgetable;
input goods $5. price $7.;
datalines;
GoodA PriceA
GoodB PriceB
GoodC PriceB
run;
So far, I used a proc sql statement without the information in the bridge table.
proc sql;
create table work.result as
select goods.date,
goods.GoodA * price.PriceA as RevenueA,
goods.GoodB * price.PriceB as RevenueB,
goods.GoodC * price.PriceB as RevenueC
from work.goods as goods, work.price as price
where goods.date = price.date;
quit;
Now, I would like to use the information from the bridge table, so that I can change the assignment of a price to a good (e.g. instead of PriceB PriceA is used for GoodC). In addition, I’d like to have the code more dynamic without the hardcoding so that I can add new goods and prices in my tables without re-coding the ‘select’ part of my sql statement.
How do I implement the bridge table in proc sql?
Thanks a lot for your help!
Your first two tables need to be vertical and not horizontal. Then the structure will not change when new goods or new price categories are added.
You can use PROC TRANSPOSE to convert your current tables.
data goods;
input year GoodA GoodB GoodC;
datalines;
2020 10 12 2
2021 12 11 5
;`
data price;
input year PriceA PriceB;
datalines;
2020 220 110
2021 250 120
;
data bridgetable;
input goods $5. price $7.;
datalines;
GoodA PriceA
GoodB PriceB
GoodC PriceB
;
proc transpose data=goods
name=goods
out=goods_tall(rename=(col1=amount))
;
by year;
var good: ;
run;
proc transpose data=price
name=price
out=price_tall(rename=(col1=unit_price))
;
by year;
var price: ;
run;
Now the tables are easy to join.
proc sql ;
create table want as
select *,unit_price*amount as revenue
from goods_tall
natural join price_tall
natural join bridgetable
;
quit;
Results
unit_
Obs goods price year amount price revenue
1 GoodA PriceA 2020 10 220 2200
2 GoodB PriceB 2020 12 110 1320
3 GoodC PriceB 2020 2 110 220
4 GoodA PriceA 2021 12 250 3000
5 GoodB PriceB 2021 11 120 1320
6 GoodC PriceB 2021 5 120 600
I am trying to fetch the tag values for a given measurement.
i have my measurement in influxDB as given below.
> select * from "EVENT_LIVE"
name: EVENT_LIVE
time GROUP_ID COUNT
---- ------------- ---------
1531008000000000000 84 2
1531008000000000000 9 8
1532822400000000000 249 1
1534636800000000000 43 1
1534636800000000000 68 1
1535241600000000000 13 13
1535241600000000000 18 4
1535241600000000000 205 2
1535241600000000000 21 6
1535241600000000000 214 1
1535241600000000000 23 1
1535241600000000000 238 1
1535241600000000000 249 1
1535241600000000000 282 14
1535241600000000000 29 1
1535241600000000000 316 3
1535241600000000000 32 13
1535241600000000000 41 7
1535241600000000000 43 1
1535241600000000000 6 1
Here the name of the measurement is EVENT_LIVE and GROUP_ID is the tag and the COUNT is the value for the measurement.
i have executed the below influxDB query.
> show tag values with key=GROUP_ID
name: EVENT_LIVE
key value
--- -----
GROUP_ID 13
GROUP_ID 18
GROUP_ID 204
GROUP_ID 206
GROUP_ID 21
GROUP_ID 217
GROUP_ID 22
GROUP_ID 238
GROUP_ID 245
GROUP_ID 249
GROUP_ID 25
GROUP_ID 259
name: EVENT_COMPLETED
key value
--- -----
GROUP_ID 15
GROUP_ID 18
GROUP_ID 204
GROUP_ID 206
GROUP_ID 21
GROUP_ID 234
GROUP_ID 22
GROUP_ID 238
GROUP_ID 245
GROUP_ID 265
GROUP_ID 13
GROUP_ID 259
The tag values are retrieved for all the measurements in the database.
But when i tried to fetch the tags specific to the measurement
EVENT_LIVE, by executing the below query, i am not seeing any results.
what can be the issue with the below query ?
show tag values with key=GROUP_ID where "name" ='PGM_LIVE'
Q: Why is my query showing data from all measurements?
A: You'll have to be explicit. That is, you need to tell your query where to look for the things, otherwise it will try its best to find the data you want from all measurements.
To narrow down the search space to just a measurement, you can use the from clause. Basically just like what you did in your SELECT statement.
e.g. show tag values from EVENT_LIVE with key=GROUP_ID;
See https://docs.influxdata.com/influxdb/v1.6/query_language/data_exploration/#from-clause
Hi Siva following should work ..
show tag values from pgm_live with key=GROUP_ID
or
show tag values from event_live with key=GROUP_ID
best regards , Marc V. (Mata)
I have the following query formula that pivots the follower count of different facebook pages from one sheet to another:
=QUERY('Page Level'!A2:N, "SELECT C, MAX(J) where J<>0 and M<>'' GROUP BY C PIVOT M label C 'Date'")
the result is something like this but with more countries:
Date Australia Austria Belgium
2018-01-01 7912 4365 1343
2018-01-02 7931 4364 1343
2018-01-03 7930 4366 1344
2018-01-04 7928 4365 1345
2018-01-05 7929 4362 1347
2018-01-06 7939 4363 1347
2018-01-07 7950 4361 1348
2018-01-08 7933 4339 1343
I would like, instead of having the full follower count, a simple difference between dates. So, if we take the table above the result would be something like this:
Date Australia Austria Belgium
2018-01-01 7912 4365 1343
2018-01-02 19 -1 0
2018-01-03 -1 2 1
and so on for each new date. Anybody knows how to do this on google sheets by any chance?
If it helps, I also have the data whereby the countries are all in the same column. However, the data is not ordered by country and by date. Rather by date and country so this solution will have to sort the data somehow beforehand I imagine.
Pivoted Data:
Date Country Followers
2018-01-01 Australia 7912
2018-01-01 Austria 4365
2018-01-01 Belgium 1343
2018-01-02 Australia 7931
2018-01-02 Austria 4364
2018-01-02 Belgium 1343
2018-01-03 Australia 7930
2018-01-03 Austria 4366
2018-01-03 Belgium 1344
2018-01-04 Australia 7928
2018-01-04 Austria 4365
2018-01-04 Belgium 1345
2018-01-05 Australia 7929
2018-01-05 Austria 4362
2018-01-05 Belgium 1347
2018-01-06 Australia 7939
2018-01-06 Austria 4363
2018-01-06 Belgium 1347
2018-01-07 Australia 7950
2018-01-07 Austria 4361
2018-01-07 Belgium 1348
2018-01-08 Australia 7933
2018-01-08 Austria 4339
2018-01-08 Belgium 1343
I have got fairly close to it using the basic idea of subtracting the pivoted data from the pivoted data offset by 1:
=iferror(arrayformula({query(A:C,"SELECT A, MAX(C) where C<>0 and B<>'' GROUP BY A PIVOT B limit 1 label A 'Date'");
query(A:C,"SELECT A, MAX(C) where C<>0 and B<>'' GROUP BY A PIVOT B offset 1")-
query(A:C,"SELECT 0, MAX(C) where C<>0 and B<>'' GROUP BY A PIVOT B")}),"")
It produces an error (as you would expect) where it tries to subtract the header and later a non-existent row. I don't know how to avoid this except by using IFERROR which produces a blank line as below:
I have a table like this
Row time viewCount
1 00:00:00 31
2 00:00:01 44
3 00:00:02 78
4 00:00:03 71
5 00:00:04 72
6 00:00:05 73
7 00:00:06 64
8 00:00:07 70
I would like to aggregate this into
Row time viewCount
1 00:00:00 31
2 00:15:00 445
3 00:30:00 700
4 00:45:00 500
5 01:00:04 121
6 01:15:00 475
.
.
.
Please help. Thanks in advance
Supposing that you actually have a TIMESTAMP column, you can use an approach like this:
#standardSQL
SELECT
TIMESTAMP_SECONDS(
UNIX_SECONDS(timestamp) -
MOD(UNIX_SECONDS(timestamp), 15 * 60)
) AS time,
SUM(viewCount) AS viewCount
FROM `project.dataset.table`
GROUP BY time;
It relies on conversion to and from Unix seconds in order to compute the 15 minute intervals. Note that it will not produce a row with a zero count for an empty 15 minute interval unlike Mikhail's solution, however (it's not clear if this is important to you).
Below is for BigQuery Standard SQL
Note: you provided simplified example of your data and below follows it - so instead of each 15 minutes aggregation, it uses each 2 sec aggregation. This is for you to be able to easy test / play with it. It is easily can be adjusted to 15 minutes by changing SECOND to MINUTE in 3 places and 2 to 15 in 3 places. Also this example uses TIME data type for time field as it is in your example so it is limited to just 24 hour period - most likely in your real data you have DATETIME or TIMESTAMP. In this case you will also need to replace all TIME_* functions with respective DATETIME_* or TIMESTAMP_* functions
So, finally - the query is:
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIME '00:00:00' time, 31 viewCount UNION ALL
SELECT TIME '00:00:01', 44 UNION ALL
SELECT TIME '00:00:02', 78 UNION ALL
SELECT TIME '00:00:03', 71 UNION ALL
SELECT TIME '00:00:04', 72 UNION ALL
SELECT TIME '00:00:05', 73 UNION ALL
SELECT TIME '00:00:06', 64 UNION ALL
SELECT TIME '00:00:07', 70
),
period AS (
SELECT MIN(time) min_time, MAX(time) max_time, TIME_DIFF(MAX(time), MIN(time), SECOND) diff
FROM `project.dataset.table`
),
checkpoints AS (
SELECT TIME_ADD(min_time, INTERVAL step SECOND) start_time, TIME_ADD(min_time, INTERVAL step + 2 SECOND) end_time
FROM period, UNNEST(GENERATE_ARRAY(0, diff + 2, 2)) step
)
SELECT start_time time, SUM(viewCount) viewCount
FROM checkpoints c
JOIN `project.dataset.table` t
ON t.time >= c.start_time AND t.time < c.end_time
GROUP BY start_time
ORDER BY start_time, time
and result is:
Row time viewCount
1 00:00:00 75
2 00:00:02 149
3 00:00:04 145
4 00:00:06 134
Is it possible to get individual data from cumulative?
Output of the following query is
SELECT mean("value") FROM "statsd_value" WHERE "type_instance" = 'counts' AND time > now() - 5m GROUP BY time(10s) fill(none)
TimeStamp Value
1463393810 0
1463393820 10
1463393830 23
1463393840 34
1463393850 67
1463393860 90
1463393870 104
Basically, the above data is cumulative data, I want to get individual data from that similar to this
TimeStamp Value
1463393820 10
1463393830 13
1463393840 11
1463393850 33
1463393860 23
1463393870 14
Is it possible to form query to get data in this way?
InfluxQL provides a difference function that will give you the functionality that you're looking for.
The query would look like this:
SELECT difference(mean("value")) FROM "statsd_value" WHERE "type_instance" = 'counts' AND time > now() - 5m GROUP BY time(10s) fill(none)
TimeStamp Value
1463393820 10
1463393830 13
1463393840 11
1463393850 33
1463393860 23
1463393870 14