sqlite optimize read performance - ios

I'm using sqlite in an IPhone app for a readonly database.
One use case involves issuing a lot of select statements, each returning around 3 rows.
It's not possible to reduce the number of queries, because the parameters for the next query depend on the result of the previous query.
The query itself is quite simple:
SELECT int1, int2, int3, int4 , int5, int6, int7 FROM sometable WHERE (int1 = ? AND int2 = ?) or (int3 = ? and int4 = ?) ORDER BY ROWID
The table has an index (int1, int2) and an index (int3, int4). All int's have datatype INTEGER
The query is done via C-API. A statement is compiled with sqlite3_prepare_v2() and used for all queries. After each query, sqlite3_reset() on the statement is executed, before binding the new parameters.
The database file is opened with flags SQLITE_OPEN_READONLY and SQLITE_OPEN_NOMUTEX.
Profiling on IPhone shows that big part of the time is spent in sqlite3_step() -> sqlite3VdbeExec->sqlite3BtreeBeginTrans->sqlite3PagerSharedLock->line pVfs->xAccess()
I'm no sqlite expert, but to me it looks like there's time wasted on unneeded locking.
Unneeded, because it's guaranteed that there's no other access to the database while this queries are done.
I also wonder about sqlite3BtreeBeginTrans. Are transactions created for select statements?
Can anyone tell me how to further optimize this?

The correct answer from the sqlite-user mailing list is to use EXCLUSIVE locking mode:
There are three reasons to set the locking-mode to EXCLUSIVE:
[...] 2) The number of system calls for filesystem operations is reduced, possibly resulting in a small performance increase.
Speedup about 40%...

Related

Sqlite : How many parameters can there be in an 'in' clause

I'd like to perform the following:
delete from images where image_address not in (<a long list>)
How long can this list be? (I'm guessing I might have to think of another way).
If you are using parameters (?), the maximum number is 999 by default.
If you are creating the SQL statement dynamically by inserting the values directly (which is a bad thing to do for strings), there is no upper limit on the lenght of such a list. However, there is a limit on the length of the entire SQL statement, which is one million bytes by default.
If you cannot guarantee that your query does not exceed these limits, you must use a temporary table (see LS_dev's answer).
If you have a long list, I would suggest two approaches:
First solution:
Add all data to temporary table:
CREATE TEMP TABLE lng_list(image_address);
-- Insert all you elements in lng_list table
-- ...
DELETE FROM images WHERE image_address NOT IN (SELECT image_address FROM lng_list);
Make sure to use this inside transaction to get good performace.
Second solution:
(REMOVED: only works for IN, not NOT IN...)
Performance should be fair good for any of those solutions.

Is it faster to constantly assign a value or compare

I am scanning an SQLite database looking for all matches and using
OneFound:=False;
if tbl1.FieldByName('Name').AsString = 'jones' then
begin
OneFound:=True;
tbl1.Next;
end;
if OneFound then // Do something
or should I be using
if not(OneFound) then OneFound:=True;
Is it faster to just assign "True" to OneFound no matter how many times it is assigned or should I do the comparison and only change OneFuond the first time?
I know a better way would be to use FTS3, but for now I have to scan the database and the question is more on the approach to setting OneFound as many times as a match is encountered or using the compare-approach and setting it just once.
Thanks
Your question is, which is faster:
if not(OneFound) then OneFound:=True;
or
OneFound := True;
The answer is probably that the second is faster. Conditional statements involve branches which risks branch mis-prediction.
However, that line of code is trivial compared to what is around it. Running across a database one row at a time is going to be outrageously expensive. I bet that you will not be able to measure the difference between the two options because the handling of that little Boolean is simply swamped by the rest of the code. In which case choose the more readable and simpler version.
But if you care about the performance of this code you should be asking the database to do the work, as you yourself state. Write a query to perform the work.
It would be better to change your SQL statement so that the work is done in the database. If you want to know whether there is a tuple which contains the value 'jones' in the field 'name', then a quicker query would be
with tquery.create (nil) do
begin
sql.add ('select name from tbl1 where name = :p1 limit 1');
sql.params[0].asstring:= 'jones';
open;
onefound:= not isempty;
close;
free
end;
Your syntax may vary regarding the 'limit' clause but the idea is to return only one tuple from the database which matches the 'where' statement - it doesn't matter which one.
I used a parameter to avoid problems delimiting the value.
1. Search one field
If you want to search one particular field content, using an INDEX and a SELECT will be the fastest.
SELECT * FROM MYTABLE WHERE NAME='Jones';
Do not forget to create an INDEX on the column, first!
2. Fast reading
But if you want to search within a field, or within several fields, you may have to read and check the whole content. In this case, what will be slow will be calling FieldByName() for each data row: you should better use a local TField variable.
Or forget about TDataSet, and switch to direct access to SQLite3. In fact, using DB.pas and TDataSet requires a lot of data marshalling, so is slower than a direct access.
See e.g. DiSQLite3 or our DB classes, which are very fast, but a bit of higher level. Or you can use our ORM on top of those classes. Our classes are able to read more than 500,000 rows per second from a SQLite3 database, including JSON marshalling into objects fields.
3. FTS3/FTS4
But, as you guessed, the fastest would be indeed to use the FTS3/FTS4 feature of SQlite3.
You can think of FTS4/FTS4 as a "meta-index" or a "full-text index" on supplied blob of text. Just like google is able to find a word in millions of web pages: it does not use a regular database, but full-text indexing.
In short, you create a virtual FTS3/FTS4 table in your database, then you insert in this table the whole text of your main records in the FTS TEXT field, forcing the ID field to be the one of the original data row.
Then, you will query for some words on your FTS3/FTS4 table, which will give you the matching IDs, much faster than a regular scan.
Note that our ORM has dedicated TSQLRecordFTS3 / TSQLRecordFTS4 kind of classes for direct FTS process.

How to efficiently fetch n most recent rows with GROUP BY in sqlite?

I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!

SQL Server 2000 : want to use single stored procedure to return different types of SQL queries

I've limitation to work on SQL Server 2000 on a very big project. For one module I've to create 3 to 10 stored procedures. To make it manageable I'm writing one stored procedure to return different SQL queries based on condition like:
If #QueryId = 'SelAllEmp'
Select EmpId,EmpName from EMP
ELSE IF #QueryId = 'SelEmpById'
Select EmpId,EmpName from EMP where EmpId= #EmpId
ELSE IF #QueryId = 'EMPDept'
Select EmpId, DeptId, DeptName from EMPDept
......................................
My question is, are there any hidden consequences or impacts using this technique?
I don't think the way you are approaching this is manageable at all. For the cases you've shown in the question, you should strive to make that a single query. Let the client decide whether or not they'll use the DeptName column - the client has the option to ignore it, and knows to do so because it had to pass the EmpDept argument. If your client can ignore that column, then your three queries can become one:
SELECT EmpId, EmpName, DeptName
FROM dbo.EMP
WHERE EmpId = CASE
WHEN #QueryId = 'SelEmpById' THEN #EmpId ELSE EmpId END;
This query solves all three of your conditions. To avoid getting stuck with a bad plan, you can add OPTION (RECOMPILE) to the statement WITH RECOMPILE to the procedure. Yes this can cause overhead (not as bad as Joon makes it sound), but I'll take a little compilation every time over getting sucked into a horrible plan every other day. By default, SQL Server 2000 can't optimize all of your paths for a single stored procedure.
Another option is to build the query you need with dynamic SQL. This can cause plan cache bloat, but it shouldn't be too bad if all of the options are used frequently. You can avoid the problems this can cause for plan cache bloat by using the optimize for ad hoc workloads server setting.
Two very valuable reads by Erland Sommarskog:
Dynamic Search Conditions in T-SQL (Version for SQL 2005 and Earlier)
The Curse and Blessings of Dynamic SQL
Basically, don't be afraid of dynamic SQL, but be aware of the potential issues.
Sorry, came back and edited since my answer was geared toward newer versions of SQL Server. It's hard to remember that people out there are still using SQL Server 2000 for some reason.
When a stored proc gets above a certain complexity, it will recompile whenever it is called from the client.
This places overhead on the server, and in busy apps can cause overall performance degradation if it happens enough.
That is one potential negative consequence of following this technique.
Also, your result set changes based on the input to your stored proc. That will potentially break clients that expect a certain field to be present or not.

LINQ to SQL Pagination and COUNT(*)

I'm using the PagedList class in my web application that many of you might be familiar with if you have been doing anything with ASP.NET MVC and LINQ to SQL. It has been blogged about by Rob Conery, and a similar incarnation was included in things like Nerd Dinner, etc. It works great, but my DBA has raised concerns about potential future performance problems.
His issue is around the SELECT COUNT(*) that gets issued as a result of this line:
TotalCount = source.Count();
Any action that has paged data will fire off an additional query (like below) as a result of the IQueryable.Count() method call:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0]
Is there a better way to handle this? I considered using the Count property of the PagedList class to get the item count, but realized that this won't work because it's only counting the number of items currently displayed (not the total count).
How much of a performance hit will this cause to my application when there's a lot of data in the database?
iirc this stuff is a part of index stats and should be very efficient, you should ask your DBA to substatiate his concerns, rather than prematurely optimising.
Actually, this is a pretty common issue with Linq.
Yes, index stats will get used if the statement was only SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] but 99% of the time its going to contain WHERE statements as well.
So basically two SQL statements are executed:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
SELECT blah, someint FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
You start receiving problems if the table is updated often as the COUNT(*) returned in the first statement doesnt equal the second statement...this may return an error message 'Row not found or changed.'
Some databases (Oracle, Postgresql, SQL Server I think) keep a record of row counts in the system tables; though these are sometimes only accurate to the point at which the statistics were last refreshed (Oracle). You could use this approach, if you only need a fairly-accurate-but-not-exact metric.
Which database are you using, or does that vary?
(PS I know that you are talking about MsSQL however)
I am no DBA but count(*) in MySQL is a real performance hit. Simply changing this to count(ID) really does improve the speed.
I came across this when I was querying a table with very large GLOB (Images) data. The query tool around 15 seconds to load. Changing the query to count(id) reduced the query to 0.02. Still a little slow but a hell of a lot better.
I think this is what the DBA is getting at. I have noticed then when debuggin Linq the statement that counts takes a very long time (1 second) to jump to the next statement.
Based on my finding I have to agree with the DBA's conserns...

Resources