F# query expression: How to do a left join and return on those records where the joined table is null? - f#

I am using SQLProvider in a project and I would like to run a query with a left join and return all records that are missing from the joined table.
I suspect the answer to this question will involve one or both of the packages FSharp.Data.TypeProviders and FSharpComposableQuery, although, to be honest, I can't tell where one ends and the other begins.
The common example of a left join in the above links is given as
query {
for student in db.Student do
leftOuterJoin selection in db.CourseSelection
on (student.StudentID = selection.StudentID) into result
for selection in result.DefaultIfEmpty() do
select (student, selection)
}
And from what I can tell, this is equivalent to the sql:
select *
from Student s
left outer join CourseSelection cs on s.StudentID = cs.StudentID
But what I am looking for is the F# equivalent of the sql:
select *
from Student s
left outer join CourseSelection cs on s.StudentID = cs.StudentID
where s.StudentID is null
I realize that I can just return all records and then filter them in F#, but I want the filtering to happen on the database side where things are indexed and because, in my case especially, the number of non null records is huge, and I am only interested in the null ones.

I think this should do the trick:
query {
for student in db.Student do
leftOuterJoin selection in db.CourseSelection
on (student.StudentID = selection.StudentID) into result
where (not (result.Any()))
select student
}
or a nested query:
query {
for student in db.Student do
where (query {
for selection in db.CourseSelection do
all (student.StudentID <> selection.StudentID)
})
select student
}
Edit: since you're using FSharp.Data.TypeProviders, if you have a foreign key between these two tables then you should also have a property that gives the associated CourseSelections, something like this:
query {
for student in db.Student do
where (not (student.CourseSelections.Any()))
select student
}

Related

Join two datasets based on a flag and id

I am trying to join two datasets based on a flag and id.
i.e
proc sql;
create table demo as
select a.*,b.b1,b.2
from table1 a
left join table2 on
(a.flag=b.flag and a.id=b.id) or (a.flag ne b.flag and a.id=b.id)
end;
This code runs into a loop and never produces a output.
I want to make sure that where there are flag values matching get the attributes; if not get the attributes at id level so that we do not have blank values.
This join condition cannot be optimized. It is not a good practice to use or in a join. If you check your log, you'll see this:
NOTE: The execution of this query involves performing one or more Cartesian product joins
that can not be optimized.
Instead, transform your query to do a union:
proc sql;
create table demo as
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag=b.flag and a.id=b.id
UNION
select a.*,
b.b1,
b.b2
from table1 as a
left join
table2 as b
on a.flag ne b.flag and a.id=b.id
;
quit;

Optimizing SQL query using JOIN instead of NOT IN

I have a sql query that I'd like to optimize. I'm not the designer of the database, so I have no way of altering structure, indexes or stored procedures.
I have a table that consists of invoices (called faktura) and each invoice has a unique invoice id. If we have to cancel the invoice a secondary invoice is created in the same table but with a field ("modpartfakturaid") referring to the original invoice id.
Example of faktura table:
invoice 1: Id=152549, modpartfakturaid=null
invoice 2: Id=152592, modpartfakturaid=152549
We also have a table called "BHLFORLINIE" which consists of services rendered to the customer. Some of the services have already been invoiced and match a record in the invoice (FAKTURA) table.
What I'd like to do is get a list of all services that either does not have an invoice yet or does not have an invoice that's been cancelled.
What I'm doing now is this:
`SELECT
dbo.BHLFORLINIE.LeveringsDato AS treatmentDate,
dbo.PatientView.Navn AS patientName,
dbo.PatientView.CPRNR AS patientCPR
FROM
dbo.BHLFORLINIE
INNER JOIN dbo.BHLFORLOEB
ON dbo.BHLFORLOEB.BhlForloebID = dbo.BHLFORLINIE.BhlForloebID
INNER JOIN dbo.PatientView
ON dbo.PatientView.PersonID = dbo.BHLFORLOEB.PersonID
INNER JOIN dbo.HENVISNING
ON dbo.HENVISNING.BhlForloebID = dbo.BHLFORLOEB.BhlForloebID
LEFT JOIN dbo.FAKTURA
ON dbo.BHLFORLINIE.FakturaId = FAKTURA.FakturaId
WHERE
(dbo.BHLFORLINIE.LeveringsDato >= '2017-01-01' OR dbo.BHLFORLINIE.FakturaId IS NULL) AND
dbo.BHLFORLINIE.ProduktNr IN (110,111,112,113,8050,4001,4002,4003,4004,4005,4006,4007,4008,4009,6001,6002,6003,6004,6005,6006,6007,6008,7001,7002,7003,7004,7005,7006,7007,7008) AND
((dbo.FAKTURA.FakturaType = 0 AND
dbo.FAKTURA.FakturaID NOT IN (
SELECT FAKTURA.ModpartFakturaID FROM FAKTURA WHERE FAKTURA.ModpartFakturaID IS NOT NULL
)) OR
dbo.FAKTURA.FakturaType IS NULL)
GROUP BY
dbo.PatientView.CPRNR,
dbo.PatientView.Navn,
dbo.BHLFORLINIE.LeveringsDato`
Is there a smarter way of doing this? Right now the added the query performs three times slower because of the "not in" subquery.
Any help is much appreciated!
Peter
You can use an outer join and check for null values to find non matches
SELECT customer.name, invoice.id
FROM invoices i
INNER JOIN customer ON i.customerId = customer.customerId
LEFT OUTER JOIN invoices i2 ON i.invoiceId = i2.cancelInvoiceId
WHERE i2.invoiceId IS NULL

Unusual Joins SQL

I am having to convert code written by a former employee to work in a new database. In doing so I came across some joins I have never seen and do not fully understand how they work or if there is a need for them to be done in this fashion.
The joins look like this:
From Table A
Join(Table B
Join Table C
on B.Field1 = C.Field1)
On A.Field1 = B.Field1
Does this code function differently from something like this:
From Table A
Join Table B
On A.Field1 = B.Field1
Join Table C
On B.Field1 = C.Field1
If there is a difference please explain the purpose of the first set of code.
All of this is done in SQL Server 2012. Thanks in advance for any help you can provide.
I could create a temp table and then join that. But why use up the cycles\RAM on additional storage and indexes if I can just do it on the fly?
I ran across this scenario today in SSRS - a user wanted to see all the Individuals granted access through an AD group. The user was using a cursor and some temp tables to get the users out of AD and then joining the user to each SSRS object (Folders, reports, linked reports) associated with the AD group. I simplified the whole thing with Cross Apply and a sub query.
GroupMembers table
GroupName
UserID
UserName
AccountType
AccountTypeDesc
SSRSOjbects_Permissions table
Path
PathType
RoleName
RoleDesc
Name (AD group name)
The query needs to return each individual in an AD group associated with each report. Basically a Cartesian product of users to reports within a subset of data. The easiest way to do this looks like this:
select
G.GroupName, G.UserID, G.Name, G.AccountType, G.AccountTypeDesc,
[Path], PathType, RoleName, RoleDesc
from
GroupMembers G
cross apply
(select
[Path], PathType, RoleName, RoleDesc
from
SSRSOjbects_Permissions
where
Name = G.GroupName) S;
You could achieve this with a temp table and some outer joins, but why waste system resources?
I saw this kind of joins - it's MS Access style for handling multi-table joins. In MS Access you need to nest each subsequent join statement into its level brackets. So, for example this T-SQL join:
SELECT a.columna, b.columnb, c.columnc
FROM tablea AS a
LEFT JOIN tableb AS b ON a.id = b.id
LEFT JOIN tablec AS c ON a.id = c.id
you should convert to this:
SELECT a.columna, b.columnb, c.columnc
FROM ((tablea AS a) LEFT JOIN tableb AS b ON a.id = b.id) LEFT JOIN tablec AS c ON a.id = c.id
So, yes, I believe you are right in your assumption

Emulating an interval join in hive

I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.

ASP.NET MVC & EF4 Entity Framework - Are there any performance concerns in using the entities vs retrieving only the fields i need?

Lets say we have 3 tables, Users, Products, Purchases.
There is a view that needs to display the purchases made by a user.
I could lookup the data required by doing:
from p in DBSet<Purchases>.Include("User").Include("Product") select p;
However, I am concern that this may have a performance impact because it will retrieve the full objects.
Alternatively, I could select only the fields i need:
from p in DBSet<Purchases>.Include("User").Include("Product") select new SimplePurchaseInfo() { UserName = p.User.name, Userid = p.User.Id, ProductName = p.Product.Name ... etc };
So my question is:
Whats the best practice in doing this?
== EDIT
Thanks for all the replies.
[QUESTION 1]: I want to know whether all views should work with flat ViewModels with very specific data for that view, or should the ViewModels contain the entity objects.
Real example: User reviews Products
var query = from dr in productRepository.FindAllReviews()
where dr.User.UserId = 'userid'
select dr;
string sql = ((ObjectQuery)query).ToTraceString();
SELECT [Extent1].[ProductId] AS [ProductId],
[Extent1].[Comment] AS [Comment],
[Extent1].[CreatedTime] AS [CreatedTime],
[Extent1].[Id] AS [Id],
[Extent1].[Rating] AS [Rating],
[Extent1].[UserId] AS [UserId],
[Extent3].[CreatedTime] AS [CreatedTime1],
[Extent3].[CreatorId] AS [CreatorId],
[Extent3].[Description] AS [Description],
[Extent3].[Id] AS [Id1],
[Extent3].[Name] AS [Name],
[Extent3].[Price] AS [Price],
[Extent3].[Rating] AS [Rating1],
[Extent3].[ShopId] AS [ShopId],
[Extent3].[Thumbnail] AS [Thumbnail],
[Extent3].[Creator_UserId] AS [Creator_UserId],
[Extent4].[Comment] AS [Comment1],
[Extent4].[DateCreated] AS [DateCreated],
[Extent4].[DateLastActivity] AS [DateLastActivity],
[Extent4].[DateLastLogin] AS [DateLastLogin],
[Extent4].[DateLastPasswordChange] AS [DateLastPasswordChange],
[Extent4].[Email] AS [Email],
[Extent4].[Enabled] AS [Enabled],
[Extent4].[PasswordHash] AS [PasswordHash],
[Extent4].[PasswordSalt] AS [PasswordSalt],
[Extent4].[ScreenName] AS [ScreenName],
[Extent4].[Thumbnail] AS [Thumbnail1],
[Extent4].[UserId] AS [UserId1],
[Extent4].[UserName] AS [UserName]
FROM [ProductReviews] AS [Extent1]
INNER JOIN [Users] AS [Extent2] ON [Extent1].[UserId] = [Extent2].[UserId]
LEFT OUTER JOIN [Products] AS [Extent3] ON [Extent1].[ProductId] = [Extent3].[Id]
LEFT OUTER JOIN [Users] AS [Extent4] ON [Extent1].[UserId] = [Extent4].[UserId]
WHERE N'615005822' = [Extent2].[UserId]
or
from d in productRepository.FindAllProducts()
from dr in d.ProductReviews
where dr.User.UserId == 'userid'
orderby dr.CreatedTime
select new ProductReviewInfo()
{
product = new SimpleProductInfo() { Id = d.Id, Name = d.Name, Thumbnail = d.Thumbnail, Rating = d.Rating },
Rating = dr.Rating,
Comment = dr.Comment,
UserId = dr.UserId,
UserScreenName = dr.User.ScreenName,
UserThumbnail = dr.User.Thumbnail,
CreateTime = dr.CreatedTime
};
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[Thumbnail] AS [Thumbnail],
[Extent1].[Rating] AS [Rating],
[Extent2].[Rating] AS [Rating1],
[Extent2].[Comment] AS [Comment],
[Extent2].[UserId] AS [UserId],
[Extent4].[ScreenName] AS [ScreenName],
[Extent4].[Thumbnail] AS [Thumbnail1],
[Extent2].[CreatedTime] AS [CreatedTime]
FROM [Products] AS [Extent1]
INNER JOIN [ProductReviews] AS [Extent2] ON [Extent1].[Id] = [Extent2].[ProductId]
INNER JOIN [Users] AS [Extent3] ON [Extent2].[UserId] = [Extent3].[UserId]
LEFT OUTER JOIN [Users] AS [Extent4] ON [Extent2].[UserId] = [Extent4].[UserId]
WHERE N'userid' = [Extent3].[UserId]
ORDER BY [Extent2].[CreatedTime] ASC
[QUESTION 2]: Whats with the ugly outer joins?
In general, only retrieve what you need, but keep in mind to retrieve enough information so your application is not too chatty, so if you can batch a bunch of things together, do so, otherwise you'll pay network traffic cost everytime you need to go back to the database and retrieve some more stuffs.
In this case, assuming you will only need those info, I would go with the second approach (if that's what you really need).
Eager loading with .Include doesn't really play nice when you want filtering (or ordering for that matter).
That first query is basically this:
select p.*, u.*, p2.*
from products p
left outer join users u on p.userid = u.userid
left outer join purchases p2 on p.productid = p2.productid
where u.userid == #p1
Is that really what you want?
There is a view that needs to display the purchases made by a user.
Well then why are you including "Product"?
Shouldn't it just be:
from p in DBSet<Purchases>.Include("User") select p;
Your second query will error. You must project to an entity on the model, or an anonymous type - not a random class/DTO.
To be honest, the easiest and most well performing option in your current scenario is to query on the FK itself:
var purchasesForUser = DBSet<Purchases>.Where(x => x.UserId == userId);
That should produce:
select p.*
from products p
where p.UserId == #p1
The above query of course requires you to include the foreign keys in the model.
If you don't have the FK's in your model, then you'll need more LINQ-Entities trickery in the form of anonymous type projection.
Overall, don't go out looking to optimize. Create queries which align with the scenario/business requirement, then optimize if necessary - or look for alternatives to LINQ-Entities, such as stored procedures, views or compiled queries.
Remember: premature optimization is the root of all evil.
*EDIT - In response to Question Update *
[QUESTION 1]: I want to know whether all views should work with flat ViewModels with very specific data for that view, or should the ViewModels contain the entity objects.
Yes - ViewModel's should only contain what is required for that View. Otherwise why have the ViewModel? You may as well bind straight to the EF model. So, setup the ViewModel which only the fields it needs for the view.
[QUESTION 2]: What's with the ugly outer joins?
That is default behaviour for .Include. .Include always produces a left outer join.
I think the second query will throw exception because you can't map result to unmapped .NET type in Linq-to-entities. You have to return annonymous type and map it to your object in Linq-to-objects or you have to use some advanced concepts for projections - QueryView (projections in ESQL) or DefiningQuery (custom SQL query mapped to new readonly entity).
Generally it is more about design of your entities. If you select single small entity it is not a big difference to load it all instead of projection. If you are selecting list of entities you should consider projections - expecially if tables contains columns like nvarchar(max) or varbinar(max) which are not needed in your result!
Both create almost the same query: select from one table, with two inner joins. The only thing that changes from a database perspective is the amount of fields returned, but that shouldn't really matter that much.
I think here DRY wins from a performance hit (if it even exists): so my call is go for the first option.

Resources