PSQL group by vs. aggregate speed - psql

So, the general question is, what's faster, taking an aggregate of a field or having extra expressions in the GROUP BY clause. Here are the two queries.
Query 1 (extra expressions in GROUP BY):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN some_date_field IS NOT NULL
THEN
FLOOR(((some_date_field - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
some_integer * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1, some_date_field, some_integer
) subquery
Query 2 (using an (arbitrary, because each record for the table 2 fields in question here have the same value (in this dataset)) aggregate function):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN MAX(some_date_field) IS NOT NULL
THEN
FLOOR(((MAX(some_date_field) - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
MAX(some_integer) * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1
) subquery
As far as I can tell, psql doesn't provide good benchmarking tools. \timing on only times for one query, so running a benchmark with enough trials for meaningful results is... tedious at best.
For the record, I did do this at about n=50 and saw the aggregate method (Query 2) run faster on average, but a p value of ~.13, so not quite conclusive.
'sup with that?

The general answer - should be +- same. There's a chance to hit/miss function based index when using/not using functions on a field, but not aggregation function and in where clause more then in column list. But this is speculation only.
What you should use for analyzing execution is EXPLAIN ANALYZE. In plan you not only see scan types, but also number of iterations, cost and individual operations time. And of course you can use it with psql

Related

Postgres Join on Dynamic Subquery

Background
I have 2 data tables.
For each row in tableA, I want to find the rows in tableB with the closest dates and join those values onto the row from tableA.
Example tables:
tableA:
p_id
category
l_date
1
catA
2005-01-05
1
catB
2005-06-10
2
catC
2000-01-10
tableB:
p_id
e_id
e_date
1
22
2005-01-01
1
23
2005-01-06
1
24
2005-01-06
1
28
2005-01-10
2
29
2010-08-10
desired result:
p_id
category
l_date
e_id
e_date
1
catA
2005-01-05
23
2005-01-06
1
catA
2005-01-05
24
2005-01-06
1
catB
2005-06-10
28
2005-01-10
2
catC
2000-01-10
29
2010-08-10
Tried
This query does not work, but I think this is the direction I should be going.
select a.p_id, a.category, a.l_date, c.e_id, c.e_date from tableA a
left join lateral
(
select top 1 p_id, e_id, e_date from tableB b
where a.pid = b.pid
order by abs(datediff(days, a.l_date, b.e_date))
) c on True;
TableA and tableB are massive, 17m and 150m respective rows.
Does this sound like the correct approach?
Using redshift cluster, running postgres 8.x
Correlated subquery approaches or a full cross join approach will all perform the task of comparing every row in one table with every row in the other (in one manner or another). Comparing (joining) all these rows when the tables get large get prohibitive. In these cases different approaches are needed.
Brute forcing won't be fast (if it even completes) so we need to be a bit more efficient in going about this. I tell clients to think about how they would do this query (by hand) if I gave them stacks of index cards. A person values their time so they don't go about this by making all possible combinations, they would come up with a more efficient way that they can complete quickly and get back to their lives. In cases like the one you are describing you need to find the more efficient approach. I'd be happy to talk to you more about building these types of queries.
Taking your data (and sprucing it up a bit for some more interesting cases) I created an example of how you can do this. (Yes, you could cross join the small tables and do this with simpler SQL but that won't scale.)
Data setup:
create table tableA (p_id int, category varchar(64), l_date date);
insert into tableA values
(1,'catA','2005-01-05'),
(1,'catB','2005-06-10'),
(2,'catC','2000-01-10');
create table tableB (p_id int, e_id int, e_date date);
insert into tableB values
(1,22,'2005-01-01'),
(1,23,'2005-01-06'),
(1,24,'2005-06-01'),
(1,28,'2005-06-15'),
(2,29,'2010-08-10');
The query looks like:
with combined as
(
select
*,
coalesce(max(l_date) OVER (partition by p_id order by
dt rows between unbounded preceding and 1 preceding), '1970-01-01'::date) cb,
coalesce(min(l_date) OVER (partition by p_id order by
dt desc rows between unbounded preceding and 1 preceding), '2100-01-01'::date) ca
from
(
select
p_id,
category,
l_date,
NULL as e_id,
NULL as e_date,
l_date dt
from
tableA
union all
select
p_id,
NULL as category,
NULL as l_date,
e_id,
e_date,
e_date dt
from
tableB
) c
)
,
closest as
(
select
p_id,
e_id,
e_date,
cb,
ca,
case
when
coalesce(e_date - cb, 0) > (ca - e_date)
then ca
else cb
end closest
from
combined
where
e_date is not NULL
)
select
c.p_id,
a.category,
a.l_date,
c.e_id,
c.e_date
from
closest c
left join tableA a
on c.closest = a.l_date and c.p_id = a.p_id
order by
c.p_id,
c.e_id ;
While this can look like a lot it isn't that complex. First CTE finds the closest l_date earlier than e_date (cb) and the closest l_date later than e_date (ca). It does this on on UNIONed set of data to allow for windowing. The second CTE just determines which is closer, ca or cb, and produces this as "closest". It also strips out all the tableB information that was added by the UNION (no longer needed). Lastly this "closest" date provides the join on information needed to build the final result.
Now this query doesn't account of many possible real-world data issues that can happen so take this as a starting point. I'm also making some assumptions about your data based on the test data (like no 2 rows in tableA will have the same l_date and P_id). So use this as a starting point.
And a last word on performance - while window functions are not cheap and will do more work as your data tables increase in size, they are orders of magnitude more performant than cross-joining massive tables. What you are looking to do is complex so will take some time but this is the fastest way I have found perform these complex operations that would normally be a massive loop problem.

Rails relation ordering?

So I want to translate this SQL query into Rails (and in this EXACT order):
Suppose I have
WITH sub_table as (
SELECT * FROM main_table LIMIT 10 OFFSET 100 ORDER BY id
)
SELECT * FROM sub_table INNER JOIN other_table
ON sub_table.id = other_table.other_id
The importance here is that the order of execution must be:
LIMIT and OFFSET in that sub_table query MUST be executed first
The second statement should happen after.
So if the relations I have are called OtherTable and MainTable does something like this work:
subTableRelation = MainTable.order(id: :asc).limit(10).offset(100)
subTableRelation.join(OtherTable, ....)
The main question here is how Rails Relation execution order impacts things.
While ActiveRecord does not provide CTEs in its high level API, Arel will allow you to build this exact query.
Since you did not provide models and obfuscated the table names I will build this completely in Arel for the time being.
sub_table = Arel::Table.new('sub_table')
main_table = Arel::Table.new('main_table')
other_table = Arel::Table.new('other_table')
sub_table_query = main_table.project(Arel.star).take(10).skip(100).order(main_table[:id])
sub_table_alias = Arel::Nodes::As.new(Arel.sql(sub_table.name),sub_table_query)
query = sub_table.project(Arel.star)
.join(other_table).on(sub_table[:id].eq(other_table[:other_id]))
.with(sub_table_alias)
query.to_sql
Output :
WITH sub_table AS (
SELECT
*
FROM main_table
ORDER BY main_table.id
-- Output here will differ by database
LIMIT 10 OFFSET 100
)
SELECT
*
FROM sub_table
INNER JOIN other_table ON sub_table.id = other_table.other_id
If you are able to provide better context I can provided a better solution, most likely resulting in an ActiveRecord::Relation object which is likely to be preferable for chaining and model access purposes.

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

Linq Query Timing Out

I have this query that uses the DBContext entities I created.
var referral = entities.StudentReferrals.Where(x => x.ReferralID == p && x.SchoolYear == year).FirstOrDefault();
When I remove x.SchoolYear == year the query works fine, but with it my query times out. The opposite of what I would expect to happen, I would expect the more you narrow a query down via Where clause constraints the less likely it would time out.
SchoolYear is a field in the query and the query itself is valid, when I perform the query within SQL Studio Manager it returns results in less than a second.
My confusion is, why would adding a constraint to the Where clause cause a query to time out??
x.SchoolYear and year are both strings.
The full query is...
SELECT [Extent1].[BirthDate] AS [BirthDate],
[Extent1].[LegalFirstName] AS [LegalFirstName],
[Extent1].[LegalLastName] AS [LegalLastName],
[Extent1].[PreferredFirstName] AS [PreferredFirstName],
[Extent1].[PreferredLastName] AS [PreferredLastName],
[Extent1].[StudentNumber] AS [StudentNumber],
[Extent1].[LegacyStudentNumber] AS [LegacyStudentNumber],
[Extent1].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[Extent1].[OEN] AS [OEN],
[Extent1].[StatusIndicator] AS [StatusIndicator],
[Extent1].[SchoolYear] AS [SchoolYear],
[Extent1].[ReferralID] AS [ReferralID],
[Extent1].[PersonID] AS [PersonID],
[Extent1].[Active] AS [Active],
[Extent1].[ServiceTypeID] AS [ServiceTypeID],
[Extent1].[IsSchoolActive] AS [IsSchoolActive],
[Extent1].[Principal] AS [Principal],
[Extent1].[SchoolName] AS [SchoolName],
[Extent1].[SchoolCode] AS [SchoolCode],
[Extent1].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[Extent1].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[Extent1].[TranscriptSchoolName] AS [TranscriptSchoolName],
[Extent1].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[Extent1].[GuardianFirstName] AS [GuardianFirstName],
[Extent1].[GuardianLastName] AS [GuardianLastName],
[Extent1].[AreaCode] AS [AreaCode],
[Extent1].[ContactNo] AS [ContactNo],
[Extent1].[ReferredByFirstName] AS [ReferredByFirstName],
[Extent1].[ReferredByLastName] AS [ReferredByLastName],
[Extent1].[ReferredDate] AS [ReferredDate],
[Extent1].[Reason] AS [Reason],
[Extent1].[gender] AS [gender],
[Extent1].[grade] AS [grade],
[Extent1].[HomeroomTeacher] AS [HomeroomTeacher],
[Extent1].[IntakeTeamMember] AS [IntakeTeamMember],
[Extent1].[IntakeMemberID] AS [IntakeMemberID]
FROM (SELECT [StudentReferrals].[BirthDate] AS [BirthDate],
[StudentReferrals].[LegalFirstName] AS [LegalFirstName],
[StudentReferrals].[LegalLastName] AS [LegalLastName],
[StudentReferrals].[PreferredFirstName] AS [PreferredFirstName],
[StudentReferrals].[PreferredLastName] AS [PreferredLastName],
[StudentReferrals].[gender] AS [gender],
[StudentReferrals].[StudentNumber] AS [StudentNumber],
[StudentReferrals].[LegacyStudentNumber] AS [LegacyStudentNumber],
[StudentReferrals].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[StudentReferrals].[OEN] AS [OEN],
[StudentReferrals].[StatusIndicator] AS [StatusIndicator],
[StudentReferrals].[SchoolYear] AS [SchoolYear],
[StudentReferrals].[grade] AS [grade],
[StudentReferrals].[ReferralID] AS [ReferralID],
[StudentReferrals].[PersonID] AS [PersonID],
[StudentReferrals].[Active] AS [Active],
[StudentReferrals].[ServiceTypeID] AS [ServiceTypeID],
[StudentReferrals].[IsSchoolActive] AS [IsSchoolActive],
[StudentReferrals].[Principal] AS [Principal],
[StudentReferrals].[SchoolName] AS [SchoolName],
[StudentReferrals].[SchoolCode] AS [SchoolCode],
[StudentReferrals].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[StudentReferrals].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[StudentReferrals].[TranscriptSchoolName] AS [TranscriptSchoolName],
[StudentReferrals].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[StudentReferrals].[GuardianFirstName] AS [GuardianFirstName],
[StudentReferrals].[GuardianLastName] AS [GuardianLastName],
[StudentReferrals].[AreaCode] AS [AreaCode],
[StudentReferrals].[ContactNo] AS [ContactNo],
[StudentReferrals].[ReferredByFirstName] AS [ReferredByFirstName],
[StudentReferrals].[ReferredByLastName] AS [ReferredByLastName],
[StudentReferrals].[ReferredDate] AS [ReferredDate],
[StudentReferrals].[IntakeTeamMember] AS [IntakeTeamMember],
[StudentReferrals].[IntakeMemberID] AS [IntakeMemberID],
[StudentReferrals].[Reason] AS [Reason],
[StudentReferrals].[HomeroomTeacher] AS [HomeroomTeacher]
FROM [dbo].[StudentReferrals] AS [StudentReferrals]) AS [Extent1]
WHERE ([Extent1].[ReferralID] = #p__linq__0) AND ([Extent1].[SchoolYear] = #p__linq__1)
Here is the StudentReferral definition...
SELECT TOP (100) PERCENT p.person_id AS PersonID, p.birth_date AS BirthDate, p.legal_first_name AS LegalFirstName, p.legal_surname AS LegalLastName, p.preferred_first_name AS PreferredFirstName,
p.preferred_surname AS PreferredLastName, p.gender, p.student_no AS StudentNumber, p.legacy_student_number AS LegacyStudentNumber, p.transcript_school_code AS TranscriptSchoolCode,
p.oen_number AS OEN, s.status_indicator_code AS StatusIndicator, s.school_year AS SchoolYear, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int)
< 9 THEN 1 ELSE 0 END END AS bit) AS IsElementary, t.SchoolName, t.SchoolCode, t.NearNorthSchoolCode, pg.person_id AS GuardianID, pg.legal_first_name AS GuardianFirstName,
pg.legal_surname AS GuardianLastName, pt.area_code AS AreaCode, pt.phone_no AS ContactNo, pt.email_account AS Email
FROM Trillium.dbo.persons AS p INNER JOIN
Trillium.dbo.student_registrations AS s ON s.person_id = p.person_id INNER JOIN
dbo.Schools AS t ON t.SchoolCode = s.school_code INNER JOIN
NNDSB_AD_Routines.dbo.Students_Trillium_Guardians AS g ON s.person_id = g.student_person_id INNER JOIN
Trillium.dbo.persons AS pg ON g.contact_person_id = pg.person_id INNER JOIN
Trillium.dbo.person_telecom AS pt ON pg.person_id = pt.person_id
WHERE (s.status_indicator_code IN ('Active', 'PreReg')) AND (pt.telecom_type_name = 'home')
GROUP BY p.person_id, p.birth_date, p.legal_first_name, p.legal_surname, p.preferred_first_name, p.preferred_surname, p.gender, p.student_no, p.legacy_student_number, p.transcript_school_code, p.oen_number,
s.status_indicator_code, s.school_year, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int) < 9 THEN 1 ELSE 0 END END AS bit), t.SchoolName,
t.SchoolCode, t.NearNorthSchoolCode, pg.person_id, pg.legal_first_name, pg.legal_surname, pt.area_code, pt.phone_no, pt.email_account, g.primary_contact_priority
ORDER BY g.primary_contact_priority
I can almost guarantee that the query that EF produces and the query you're executing in SSMS are not the exact same SELECT statement. You probably wrote something like what Stephen Byrne has in his answer, i.e.
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
Right off the bat this query doesn't have a TOP qualifier on it which your EF query probably will due to the presence of the FirstOrDefault call.
Your first step should be to use something like SQL Profiler and grab the actual query that EF is generating. It's possible that with that query the optimizer is choosing to do a table scan because of the type of query that is being generated.
This likely won't make any difference, but you could also try rewriting your query as:
var referral = entities.StudentReferrals.FirstOrDefault(x => x.ReferralID == p && x.SchoolYear == year);
As an example, when I write the following query against my database:
OrganizationalNodes.FirstOrDefault(on => on.Name == "Justice League")
EF generates the following SQL:
SELECT
[Limit1].[C1] AS [C1],
[Limit1].[Id] AS [Id],
-- columns omitted for brevity
FROM ( SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
-- columns omitted for brevity
'0X0X' AS [C1]
FROM [dbo].[OrganizationalItems] AS [Extent1]
INNER JOIN [dbo].[OrganizationalNodes] AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]
WHERE N'Justice League' = [Extent1].[Name]
) AS [Limit1]
Well, to answer the question
why would adding a constraint to the Where clause cause a query to time out
The most likely cause is that you have a lot of data in the table, but no index covers the SchoolYear column. Therefore when you include in in a WHERE clause, this causes a Table Scan (because every row has to be checked to see if it should be included or not in the result set)
If you use SQL Server Management Studio and write the query manually for e.g
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
And then include the actual Execution Plan (Query->Include Actual Estimation Plan) then you will get the execution breakdown which will show you clearly if there is a Table Scan involved. If there is, create an index to "cover" the columns involved and it should fix your issue.
Update
Another possible solution could be to run DBCC FREEPROCCACHE to clear out any cached execution plans just in case for some reason SQL Server has picked something insane for whatever query is generated by Entity Framework.

How to use joins and averages together in Hive queries

I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.
I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here
Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;

Resources