How to merge three tables by different conditions? - join

I am trying to do some calculations based on joined data sets.
My aim is to calculate the revenue in prices of the previous year.
The code below works for the revenue with current prices and sales.
data work.price;
input date date. car $ price;
format date date9. ;
datalines;
01Jan19 Model1 7000
01Jan19 Model2 4000
01Jan19 Model3 5000
01Jan20 Model1 7500
01Jan20 Model2 4800
01Jan20 Model3 4500
01Jan21 Model1 8000
01Jan21 Model2 5200
01Jan21 Model3 4000
run;
data work.sales;
input date date. type $ sales;
format date date9. ;
datalines;
01Jan19 A 10
01Jan19 B 4
01Jan19 C 50
01Jan20 A 18
01Jan20 B 10
01Jan20 C 16
01Jan21 A 22
01Jan21 B 8
01Jan21 C 13
run;
data work.assignment;
input car $6. type $7.;
datalines;
Model1 A
Model2 B
Model3 C
run;
proc sql ;
create table want as
select Date format date9., *,price*sales as return
from sales
natural join price
natural join assignment
;
quit;
My solution so far, was to shift the times series of the prices prior to joining.
But I wonder if this step can be done more efficiently in the proc sql statement.
data work.price;
set work.price;
Date = intnx('month',date,+12);
run;
Thanks a lot for your help!

You can do that by using an inner join and specifying the join conditions. For you this would be something like
proc sql;
create table want as
select
p.date format date9., s.type, s.sales, p.car, p.price, p.price * s.sales as return
from
sales as s
inner join
prices as p
on
intnx("year", s.date, 1) = p.date
inner join
assignment as a
on
s.type = a.type and
p.car = a.car;
quit;
Me personally, I always do explicit inner joins and do not rely on natural joins. If something is wrong with your input data, you get an error instead of a wrong result. Also, the needed columns need to be explicitly selected instead of just selecting *.

Related

JOIN ON second highest value (Impala)

I don't know how or even if this is possible.... I am trying to JOIN tables on the second highest value. I tried rowNumber, lag, lead & rank but haven't been able to get any of them to do what I need. To summarize, I'm just trying to shift the activitydate table down one row to join on rollDate minus 1 (but can't use -1 because they are not consistent dates, there are days missing.)
Does anyone know a good way to do this? Any suggestions are appreciated!
Select
ds.activitydate
,sum(ws.weeklyTotals / ds.daysBetween) as newRunRates -- getting an average of daily activity from weekly totals
from
(select
fsc.activitydate
,fsc.weekstart
,max(fsc.activitydate) OVER (partition by fsc.weekstart) as rollUpDate
,datediff(to_date(max(fsc.activitydate) OVER (partition by fsc.weekstart)), to_date(fsc.weekstart)) + 1 as daysBetween
from fiscalcalendar fsc
) ds -- used this to get a week-ending date bc that is what I need to join on. I only have a week start in this table
left join
(select
activitydate_iso
,count(distinct assignedmaincomponentid) as weeklyTotals
from activityTable
group by 1
) ws -- weeklySplits -- this gives me my weekly totals by a week ending date
on ds.rollUpDate = ws.activitydate_iso
-- need this join logic to actually be
-- on ds.rollUpDate = (max(ws.activitydate_iso) where activitydate_iso < rollUpDate)
where activitydate between '2020-05-22' and '2020-06-15'
group by 1,2
order by 1,2 ```

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

How to combine query results?

I have three queries that are tied together. The final output requires multiple loops over the queries. This way works just fine but seems very inefficient and too complex in my opinion. Here is what I have:
Query 1:
<cfquery name="qryTypes" datasource="#application.datasource#">
SELECT
t.type_id,
t.category_id,
c.category_name,
s.type_shortcode
FROM type t
INNER JOIN section s
ON s.type_id = t.type_id
INNER JOIN category c
ON c.category_id = t.category_id
WHERE t.rec_id = 45 -- This parameter is passed from form field.
ORDER BY s.type_name,c.category_name
</cfquery>
Query Types will produce this set of results:
4 11 SP PRES
4 12 CH PRES
4 13 MS PRES
4 14 XN PRES
Then loop over query Types and get the records from another query for each record that match:
Query 2:
<cfloop query="qryTypes">
<cfquery name="qryLocation" datasource=#application.datasource#>
SELECT l.location_id, l.spent_amount
FROM locations l
WHERE l.location_type = '#trim(category_name)#'
AND l.nofa_id = 45 -- This is form field
AND l.location_id = '#trim(category_id)##trim(type_id)#'
GROUP BY l.location_id,l.spent_amount
ORDER BY l.location_id ASC
</cfquery>
<cfset spent_total = arraySum(qryLocation['spent_amount']) />
<cfset amount_total = 0 />
<cfloop query="qryLocation">
<cfquery name="qryFunds" datasource=#application.datasource#>
SELECT sum(budget) AS budget
FROM funds f
WHERE f.location_id= '#qryLocation.location_id#'
AND nofa_id = 45
</cfquery>
<cfscript>
if(qryFunds.budgetgt 0) {
amount_total = amount_total + qryFunds.budget;
}
</cfscript>
</cfloop>
<cfset GrandTotal = GrandTotal + spent_total />
<cfset GrandTotalad = GrandTotalad + amount_total />
</cfloop>
After the loops are completed this is result:
CATEGORY NAME SPENT TOTAL AMOUNT TOTAL
SP 970927 89613
CH 4804 8759
MS 9922 21436
XN 39398 4602
Grand Total: 1025051 124410
Is there a good way to merge this together and have only one query instead of three queries and inner loops? I was wondering if this might be a good fit for a stored procedure and then do all data manipulations in there? If anyone have suggestions please let me know.
qryTypes returns X records
qryLocation returns Y records
So far you've run (1 + X) queries.
qryFunds returns Z records
Now you've run (1 + X)(Y) queries.
The more data each returns, the more queries you'll run. Obviously not good.
If all you want is the final totals for each category, in a stored procedure, you could create a temp table with the joined data from qryTypes and qryLocation. Then your last qryFunds is just joined against that temp table data.
SELECT
sum(budget) AS budget
FROM
funds f
INNER JOIN
#TEMP_TABLE t ON t.location_id = f.location_id
AND
nofa_id = 45
You could then get other sums off the temp table if needed. It's possible this could all be worked into a single query, but maybe this helps you get there.
Also, a stored procedure can return multiple record sets, so you can have one return the aggregated table amount data and a 2nd return the grand total. This would keep all the calculations on the database and no need for CF to be involved.

SAS base programming

Is there a way to join/merge two datasets/tables in which one register in dataset B refers at the same time to a row (condition 1) and to a column (condition 2) of dataset A?:
Condition 1: b.City = b.getColumnName() AND
Condition 2: b.Part_code = a.Part_code
What I am looking for would be something equivalent to the getColumnName(), to be able to make the comparison at the same time by row and by column.
Datasets are as follows (simplified examples):
Dataset A:
Part_code Miami LA
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
Dataset B:
Part_code City
A_1 Miami
Desired output (joined):
Part_code City Part_stock
A_1 Miami 60000
Thank you very much in advance!
What you are really looking to do is pivot the A data set and then filter it based on the cities in the B data set.
Proc Transpose to pivot the table:
proc sort data=a;
by part_code;
run;
proc transpose data=A out=A(rename=(_name_=city col1=part_stock));
by part_code;
run;
Then use an inner join to filter based on B
Proc sql noprint;
create table want as
select a.*
from A as a
inner join
B as b
on a.part_code = b.part_code
and a.city = b.city;
quit;
DomPazz's answer is the better solution because the parts table should be restructured to better handle lookups like this.
With that stated, here's a solution that uses the existing table structure. Note that both tables A and B must be sorted by Part_Code first.
data want;
merge
B (in=b)
A (in=a)
;
by Part_Code;
if a & b;
array invent(*) Miami--LA;
do i = 1 to dim(invent);
if vname(invent(i)) = City then do;
stock = invent(i);
output;
end;
end;
keep Part_Code City stock;
run;
One other option, VVALUEX will look up a columns value based on the name.
VVALUEX cannot be used in a SQL query though, if that matters.
data tableA;
infile cards truncover;
input Part_code $ Miami LA;
cards;
A_1 60000 38000
A_2 5000 2000
A_3 1000 60000
;
data tableB;
infile cards truncover;
input Part_code $ City $;
cards;
A_1 Miami
;
run;
proc sort data=tableA;
by part_code;
run;
proc sort data=tableB;
by part_code;
run;
data want;
merge tableB (in=B) tableA (in=A);
by part_code;
if B;
Value=input(vvaluex(City), best32.);
keep part_code city value;
run;

complex db2/sql query with time-sampling, group, map, join and csv export

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?
here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?
Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)
the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;

Resources