create new field based on multiple resident tables - join

Given multiple in-resident tables, I'd like to create a new field based on fields in different tables.
table1:
LOAD * INLINE [
id1,val1
a1,car1
a2,car1
];
table2:
LOAD * INLINE [
id2,id1,val2
b1,a1,type1
b2,a2,type2
];
table3:
LOAD * INLINE [
id3,id2,val3
c1,b1,mfr1
c2,b2,mfr2
];
For the sake of argument, assume table1 has ~1M rows, table2 ~1K rows, and table3 ~10 rows. I'd like to create a new field that is either added to table1 or perhaps in a new table linked by id1, resulting in:
id1 val1 newval
a1 car1 car1type1mfr1
a2 car2 car2type2mfr2
Efforts:
newtable:
load val1 & val2 & val3 as newval;
No errors but no newtable or newval.
newtable:
left join (table2)
load val1&val2 as newval resident table1;
Errs with Field not found - <val2>. (Obviously I want to extend this to include table3, but if I can't do it with 2 tables then 3 just won't work.
The real data includes seven tables for this new field (lots of foreign keys). The data is being loaded from QVDs (the data is shared across multiple QVWs), closely mimicking a SQL database; none of the tables are row-wise redundant, so combining db tables into a single QVD table may be inefficient. (Plus refreshing the data is incredibly easier one table at a time.) A colleague suggested I load-join each of the QVDs into one huge table, but that doesn't seem right (nor have I successfully chain-joined even a few tables).
Using QV 12.0 desktop on win10-x64 for deployment on QVS.

#TheBudac's was part of the way there, but it only merged two of the three. Most of the problems were stemming from incorrect multi-table joins. My confusion was in the "join" syntax in Qlik; the docs make sense to me now that I see what's happening, but it wasn't as obvious to me initially.
Here's what eventually worked best for me:
temptable:
load id1 as id1a, val1 as val1a
resident table1;
left join (temptable)
load id2 as id2a, id1 as id1a, val2 as val2a
resident table2;
left join (temptable)
load id2 as id2a, val3 as val3a
resident table3;
newtable:
load id1a as id1,
val1a & val2a & val3a as newval
resident temptable;
drop table temptable;
This produced these tables:
and this tree:
Quick walk-through:
Because I'm using left join, I start with the largest table; other joins would dictate different starting condition requirements. In my case, table1 was representing the largest, so I start with that:
temptable:
load id1 as id1a, val1 as val1a
resident table1;
Each join should be against the temporary table we're working on. Renaming variables is important so that Qlik doesn't create unnecessary synthetic keys.
left join (temptable)
load id2 as id2a, id1 as id1a, val2 as val2a
resident table2;
The use of resident is important in that it does not re-query (SQL) or re-load (QVD or other file).
Repeat with the third and further tables, always joining against temptable with the new table.
Now we use that temporary table to create our new table. You can choose to augment table1 with this data instead (certainly feasible), but for me since I'm generating several new calculated fields (not shown here), it made sense to keep them logically separated.
newtable:
load id1a as id1,
val1a & val2a & val3a as newval
resident temptable;
drop table temptable;
Note that I rename the relevant key back to its original value so that this table correctly links to table. Dropping the temporary table helps clean things up, but it does no harm to keep it around (and doing so helps in debugging/learning).

Your join is the wrong way round and QlikView can only work results after they have been joined,not in process, so you will have to do another resident load to get the values concatenated into Newval. The drop table commands are important or you will end up with massive unintentional syn tables
newtable:
left join (table1)
load * resident table2; drop table 2;
Resulttable:
load id1,
val1&val2 as NewVal
resident newtable; drop newtable;

Related

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

Joining two tables based on matching two columns

I'm trying to join two tables:
Table A has three columns: State, County, and Count (of Farmer's Markets in said county)
Table B has several columns: State, County, and several data columns (like food access score)
I'm trying to combine them in such a way as to put the Count for each State/County combination (since there are multiple counties with the same name) together with the State and County and data columns from Table B.
I've been banging my head on SAS, trying to get a join to cooperate. I read a few other questions on here, but I can't find where the mistake is in my code.
PROC SQL;
CREATE TABLE WORK.QUERY1
AS
SELECT FMDV4.State, FMDV4.County, FMDV4.Count, CFSDV1.GROC14,
CFSDV1.SUPERC14, CFSDV1.CONVS14, CFSDV1.SPECS14, CFSDV1.FOODINSEC_13_15,
CFSDV1.PCT_LACCESS_POP15, CFSDV1.DIRSALES_FARMS12, CFSDV1.FMRKT16,
CFSDV1.FOODHUB16, CFSDV1.CSA12, CFSDV1.POVRATE15, CFSDV1.PERPOV10
FROM FNLPRJT.CFSDV1 AS CFSDV1
INNER JOIN FNLPRJT.FMDV4 AS FMDV4
ON (( CFSDV1.State = FMDV4.State ) AND ( CFSDV1.County =
FMDV4.County ));
QUIT;
I also tried a few variants, like:
PROC SQL;
CREATE TABLE WORK.QUERY1
AS
SELECT FMDV4.State, FMDV4.County, FMDV4.Count, CFSDV1.GROC14,
CFSDV1.SUPERC14, CFSDV1.CONVS14, CFSDV1.SPECS14, CFSDV1.FOODINSEC_13_15,
CFSDV1.PCT_LACCESS_POP15, CFSDV1.DIRSALES_FARMS12, CFSDV1.FMRKT16,
CFSDV1.FOODHUB16, CFSDV1.CSA12, CFSDV1.POVRATE15, CFSDV1.PERPOV10
FROM FNLPRJT.CFSDV1 AS CFSDV1
INNER JOIN FNLPRJT.FMDV4 AS FMDV4
ON CFSDV1.State = FMDV4.State
WHERE CFSDV1.County = FMDV4.County;
QUIT;
I get a table of 0 rows with the columns as they should be (State, County, Count, ). I'm just missing the dang data! Can anyone please help me find my mistake?
Can you try
propcase(CFSDV1.State) = propcase(FMDV4.State)
and
propcase(CFSDV1.County) = propcase(FMDV4.County);
If this doesn't work try character functions like trim and compress to remove any blanks that might be present in the data.

Qlikview - join two table

i need to join two table in Qlikview to get result.
Table:
I need to join this two table to get result table like this
Any idea? Can i use cross table and how?
For Table1 you can use CrossTable functionality to "rotate" the table but keeping the first column.
For example:
CrossTable(Location, Quantity)
Load
Reason,
LocA,
LocB
From
[Data.xlsx] (ooxml, embedded labels, table is Table1)
;
The result table after this will be:
Location Reason Quantity
LocA R1 5
LocA R2 4
LocA R3 5
LocA R4 3
LocB R1 2
LocB R2 2
LocB R3 3
LocB R4 5
(you can learn more about CrossTable at Qlik's help site - CrossTable)
After having Table1 in this format you can create composite key (as x3ja suggested). Composite key is basically two (or more) fields concatenated. In your case the join between the tables should be on two fields - Location and Reason.
// CrossTable the data to get it in correct format
Table1_Temp:
CrossTable(Location, Quantity)
Load
Reason,
LocA,
LocB
From
[Data.xlsx] (ooxml, embedded labels, table is Table1)
;
// Resident load to form the composite key
// based on Location and Reason fields
Table1:
Load
Location & '|' & Reason as Key,
Quantity
Resident
Table1_Temp
;
// We dont need Table1_Temp table anymore
Drop Table Table1_Temp;
//Load the second table and create the same composite key
Table2:
Load
Location & '|' & Reason as Key,
Location,
Reason,
Answer
From
[Data.xlsx] (ooxml, embedded labels, table is Table2)
;
After the reload your data model will look like:
And the data:
Notice that the values for Answer, Location, Reason are null in the bottom two rows. This is because the data in Table2 (based on your screenshots) don't contains combination for LocB and R2 and LocA and R4 but Table1 does.
If you want to keep only the combinations that are present in both tables then the approach is similar but with two differences:
Table2 should be loaded first
use keep function to exclude the non common records for being loaded in Table1
(keep at Qlik's help site - keep)
If you want to see the script in action just comment the first tab and uncomment the second one in the example qvw
There are a couple of ways you could do this.
Using association. Load Table 1 twice and concatenate, creating a composite key. So you'd end up with fields ReasonLocation and Quantity. Then load Table 2 creating the same composite key, giving you ReasonLocation, Location, Reason & Answer. Then the tables would associate on that composite key.
Using a join. Load Table1, left join in Table 1 based on Reason with an if statement like if [Location] = 'LocA' then [LocA] else [LocB]. That may need you to load it into a temp table first and do the if statement in a resident load.
You could also combine the two and join the tables in #1 based on the ReasonLocation field.
Hope that helps - sorry it's not fully worked through...

Hive Bucketed Map Join

I am facing issue in executing bucketed map join.
I am using hive 0.10.
Table1 is a partitioned table on year,month and day. Each partition data is bucketed by column c1 into 128 buckets. I have almost 100 million records per day.
Table 1
create table1
(
....
....
)
partitioned by (year int,month int,day int)
CLUSTERED BY(c1) INTO 128 BUCKETS;
Table2 is a large lookup table bucketed on column c1. I have 80 million records loaded into 128 buckets.
Table 2
create table2
(
c1
c2
...
)
CLUSTERED BY(c1) INTO 128 BUCKETS;
I have checked the data and it's loaded as per expectation into buckets.
Now, I am trying to enforce bucketed map join.That's where I am stuck.
set hive.auto.convert.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.mapjoin.bucket.cache.size=1000000;
select a.c1 as c1_tb2,a.c2
b.c1,b....
from table2 a
JOIN table1 b
ON (a.c1=b.c1);
I am still not getting bucketed map join. Am I missing something? Even I tried to execute join on only 1 partition. But, still I am getting same result.
Or
Bucketed map join doesn't work partition tables?
Please help.Thanks.
This explanation is for Hive 0.13. AFAICT, bucketed map join doesn't take effect for auto converted map joins. You will need to explicitly call out map join in the syntax like this:
set hive.optimize.bucketmapjoin = true;
explain extended select /* +MAPJOIN(b) */ count(*)
from nation_b1 a
join nation_b2 b on (a.n_regionkey = b.n_regionkey);
Note that only explain extended shows you the flag that indicates if bucket map join is being used or not. Look for this line in the plan.
BucketMapJoin: true
Tables are bucketed in hive to manage/process the portion of data individually. It will make the process easy to manage and efficient in terms of performance.
Lets understand the join when the data is stored in buckets:
Lets say there are two tables user and user_visits and both table data is bucketed using user_id in 4 buckets . It means bucket 1 of user will contain rows with same user ids as that of bucket 1 of user_visits. And if a join is performed on these two tables on user_id columns, if it is possible to send bucket 1 of both tables to same mapper then good amount of optimization can be achieved. This is exactly done in bucketed map join.
Prerequisites for bucket map join:
Tables being joined are bucketized on the join columns,
The number of buckets in one table is a same/multiple of the number of buckets in the other table.
The buckets can be joined with each other, If the tables being joined are bucketized on the join columns. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

Change Data Capture with table joins in ETL

In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".

Resources