LINQ to SQL Pagination and COUNT(*) - asp.net-mvc

I'm using the PagedList class in my web application that many of you might be familiar with if you have been doing anything with ASP.NET MVC and LINQ to SQL. It has been blogged about by Rob Conery, and a similar incarnation was included in things like Nerd Dinner, etc. It works great, but my DBA has raised concerns about potential future performance problems.
His issue is around the SELECT COUNT(*) that gets issued as a result of this line:
TotalCount = source.Count();
Any action that has paged data will fire off an additional query (like below) as a result of the IQueryable.Count() method call:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0]
Is there a better way to handle this? I considered using the Count property of the PagedList class to get the item count, but realized that this won't work because it's only counting the number of items currently displayed (not the total count).
How much of a performance hit will this cause to my application when there's a lot of data in the database?

iirc this stuff is a part of index stats and should be very efficient, you should ask your DBA to substatiate his concerns, rather than prematurely optimising.

Actually, this is a pretty common issue with Linq.
Yes, index stats will get used if the statement was only SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] but 99% of the time its going to contain WHERE statements as well.
So basically two SQL statements are executed:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
SELECT blah, someint FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
You start receiving problems if the table is updated often as the COUNT(*) returned in the first statement doesnt equal the second statement...this may return an error message 'Row not found or changed.'

Some databases (Oracle, Postgresql, SQL Server I think) keep a record of row counts in the system tables; though these are sometimes only accurate to the point at which the statistics were last refreshed (Oracle). You could use this approach, if you only need a fairly-accurate-but-not-exact metric.
Which database are you using, or does that vary?

(PS I know that you are talking about MsSQL however)
I am no DBA but count(*) in MySQL is a real performance hit. Simply changing this to count(ID) really does improve the speed.
I came across this when I was querying a table with very large GLOB (Images) data. The query tool around 15 seconds to load. Changing the query to count(id) reduced the query to 0.02. Still a little slow but a hell of a lot better.
I think this is what the DBA is getting at. I have noticed then when debuggin Linq the statement that counts takes a very long time (1 second) to jump to the next statement.
Based on my finding I have to agree with the DBA's conserns...

Related

How to efficiently fetch n most recent rows with GROUP BY in sqlite?

I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!

SQL Server 2000 : want to use single stored procedure to return different types of SQL queries

I've limitation to work on SQL Server 2000 on a very big project. For one module I've to create 3 to 10 stored procedures. To make it manageable I'm writing one stored procedure to return different SQL queries based on condition like:
If #QueryId = 'SelAllEmp'
Select EmpId,EmpName from EMP
ELSE IF #QueryId = 'SelEmpById'
Select EmpId,EmpName from EMP where EmpId= #EmpId
ELSE IF #QueryId = 'EMPDept'
Select EmpId, DeptId, DeptName from EMPDept
......................................
My question is, are there any hidden consequences or impacts using this technique?
I don't think the way you are approaching this is manageable at all. For the cases you've shown in the question, you should strive to make that a single query. Let the client decide whether or not they'll use the DeptName column - the client has the option to ignore it, and knows to do so because it had to pass the EmpDept argument. If your client can ignore that column, then your three queries can become one:
SELECT EmpId, EmpName, DeptName
FROM dbo.EMP
WHERE EmpId = CASE
WHEN #QueryId = 'SelEmpById' THEN #EmpId ELSE EmpId END;
This query solves all three of your conditions. To avoid getting stuck with a bad plan, you can add OPTION (RECOMPILE) to the statement WITH RECOMPILE to the procedure. Yes this can cause overhead (not as bad as Joon makes it sound), but I'll take a little compilation every time over getting sucked into a horrible plan every other day. By default, SQL Server 2000 can't optimize all of your paths for a single stored procedure.
Another option is to build the query you need with dynamic SQL. This can cause plan cache bloat, but it shouldn't be too bad if all of the options are used frequently. You can avoid the problems this can cause for plan cache bloat by using the optimize for ad hoc workloads server setting.
Two very valuable reads by Erland Sommarskog:
Dynamic Search Conditions in T-SQL (Version for SQL 2005 and Earlier)
The Curse and Blessings of Dynamic SQL
Basically, don't be afraid of dynamic SQL, but be aware of the potential issues.
Sorry, came back and edited since my answer was geared toward newer versions of SQL Server. It's hard to remember that people out there are still using SQL Server 2000 for some reason.
When a stored proc gets above a certain complexity, it will recompile whenever it is called from the client.
This places overhead on the server, and in busy apps can cause overall performance degradation if it happens enough.
That is one potential negative consequence of following this technique.
Also, your result set changes based on the input to your stored proc. That will potentially break clients that expect a certain field to be present or not.

Performance of generated T-SQL from Entity Framework

I recently used Entity Framework for a project, despite my DBA's strong disapproval. So one day he came to my office complaining about generated T-SQL that reaches his database.
For instance, when I want to select a product based on the id, I write something like this:
context.Products.FirstOrDefault(p=>p.Id==id);
Which translates to
SELECT ... FROM (SELECT TOP 1 ... FROM PRODUCTS WHERE ID=#id)
So he is shouting, "Why on earth would you write a SELECT * FROM (SELECT TOP 1)"
So I changed my code to
context.Products.Where(p=>p.Id==id).ToList().FirstOrDefault()
and this produces a much cleaner T-SQL:
SELECT ... FROM PRODUCTS WHERE ID=#id
The inner query and the TOP 1 dissappeared. Enough mambling, my question is this: Does the first query really put an overhead for SQL Server? Is it harder to parse than the second method? The Id column has a Clustered index on. I want a good answer so I can rub it on his face (or mine)
Thanks,
Themos
Have you tried running the queries manually and comparing the executions plans?
The biggest problem here isn't that the SQL isn't perfectly formed to your DBA's standards (although I'm fairly certain that the query engine will optimize out the extra select). The second query actually returns the entire contents of the Products table which you then analyse in memory and this is definitely a task that should be performed by the DB and not the application layer.
In short, he's being a pedant; leave it the way it was.

sqlite optimize read performance

I'm using sqlite in an IPhone app for a readonly database.
One use case involves issuing a lot of select statements, each returning around 3 rows.
It's not possible to reduce the number of queries, because the parameters for the next query depend on the result of the previous query.
The query itself is quite simple:
SELECT int1, int2, int3, int4 , int5, int6, int7 FROM sometable WHERE (int1 = ? AND int2 = ?) or (int3 = ? and int4 = ?) ORDER BY ROWID
The table has an index (int1, int2) and an index (int3, int4). All int's have datatype INTEGER
The query is done via C-API. A statement is compiled with sqlite3_prepare_v2() and used for all queries. After each query, sqlite3_reset() on the statement is executed, before binding the new parameters.
The database file is opened with flags SQLITE_OPEN_READONLY and SQLITE_OPEN_NOMUTEX.
Profiling on IPhone shows that big part of the time is spent in sqlite3_step() -> sqlite3VdbeExec->sqlite3BtreeBeginTrans->sqlite3PagerSharedLock->line pVfs->xAccess()
I'm no sqlite expert, but to me it looks like there's time wasted on unneeded locking.
Unneeded, because it's guaranteed that there's no other access to the database while this queries are done.
I also wonder about sqlite3BtreeBeginTrans. Are transactions created for select statements?
Can anyone tell me how to further optimize this?
The correct answer from the sqlite-user mailing list is to use EXCLUSIVE locking mode:
There are three reasons to set the locking-mode to EXCLUSIVE:
[...] 2) The number of system calls for filesystem operations is reduced, possibly resulting in a small performance increase.
Speedup about 40%...

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

Resources