Fetch latest rows grouped by uniqe field value - ruby-on-rails

I have a table of Books with an author_id field.
I'd like to fetch an array of Books which contain only one Book of every author. The one with the latest updated_at field.
The problem with straightforward approach like Books.all.group('author_id') on Postgres is that it needs all requested field in its GROUP BY block. (See https://stackoverflow.com/a/6106195/1245302)
But I need to get all Book objects one per author, the recent one, ignoring all other fields.
It seems to me that there's enough data for the DBMS to find exactly the rows I want,
at least I could do that myself without any other fields in GROUP BY block. :)
Is there any simple Rails 3 + Postgres (version < 9) or SQL implementation
independent way to get that?
UPDATE
Nice solution for Postgres:
books.unscoped.select('DISTINCT ON(author_id) *').order('author_id').order('updated_at DESC')
BUT! there still problem remains – results are sorted by author_id in the first place, but i need to sort by updated_at inside the same author_id-s (to find, say the top-10 recent book authors).
And Postgres doesn't allow you to change order of ORDER BY arguments in DISTINCT queries :(

I don't know Rails, but hopefully showing you the SQL for what you want will help get you to a way to generate the right SQL.
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC;
The DISTINCT ON (author_id) portion should not be confused with part of the result column list -- it just says that there will be one row per author_id. The list in a DISTINCT ON clause must be the leading portion of the ORDER BY clause in such a query, and the row which is kept is the one which sorts first based on the rest of the ORDER BY clause.
With a large number of rows this way of writing the query is usually much faster than any solution based on GROUP BY or window functions, often by an order of magnitude or more. It is a PostgreSQL extension, though; so it should not be used in code which is intended to be portable.
If you want to use this result set inside another query (for example, to find the 10 most recently updated authors), there are two ways to do that. You can use a subquery, like this:
SELECT *
FROM (SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC) w
ORDER BY updated_at DESC
LIMIT 10;
You could also use a CTE, like this:
WITH w AS (
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC)
SELECT * FROM w
ORDER BY updated_at DESC
LIMIT 10;
The usual advice about CTEs holds here: use them only where there isn't another way to write the query or if needed to coerce the planner by introducing an optimization barrier. The plans are very similar, but passing the intermediate results through the CTE scan adds a little overhead. On my small test set the CTE form is 17% slower.

This is belated, but in response to questions about overriding/resetting a default order, use .reorder(nil).order(:whatever_you_want_instead)
(I can't comment, so posting as an answer for now)

Related

Performance of generated T-SQL from Entity Framework

I recently used Entity Framework for a project, despite my DBA's strong disapproval. So one day he came to my office complaining about generated T-SQL that reaches his database.
For instance, when I want to select a product based on the id, I write something like this:
context.Products.FirstOrDefault(p=>p.Id==id);
Which translates to
SELECT ... FROM (SELECT TOP 1 ... FROM PRODUCTS WHERE ID=#id)
So he is shouting, "Why on earth would you write a SELECT * FROM (SELECT TOP 1)"
So I changed my code to
context.Products.Where(p=>p.Id==id).ToList().FirstOrDefault()
and this produces a much cleaner T-SQL:
SELECT ... FROM PRODUCTS WHERE ID=#id
The inner query and the TOP 1 dissappeared. Enough mambling, my question is this: Does the first query really put an overhead for SQL Server? Is it harder to parse than the second method? The Id column has a Clustered index on. I want a good answer so I can rub it on his face (or mine)
Thanks,
Themos
Have you tried running the queries manually and comparing the executions plans?
The biggest problem here isn't that the SQL isn't perfectly formed to your DBA's standards (although I'm fairly certain that the query engine will optimize out the extra select). The second query actually returns the entire contents of the Products table which you then analyse in memory and this is definitely a task that should be performed by the DB and not the application layer.
In short, he's being a pedant; leave it the way it was.

SQLite3 Database Query Optimization

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!
Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Indices not working on sqlite table

I am using indices on columns on which I am making a search. The indices are created like this:
CREATE INDEX index1 on <TABLE>(<col1> COLLATE NOCASE ASC)
CREATE INDEX index2 on <TABLE>(<col2> COLLATE NOCASE ASC)
CREATE INDEX index3 on <TABLE>(<col3> COLLATE NOCASE ASC)
Now, the select query to search for records is like this:
select <col1> from <TABLE> where <col1> like '%monit%' AND <col2> like '%84%' GROUP BY <col1> limit 0,501;
When I run EXPLAIN QUERY PLAN on my sqlite database like this:
EXPLAIN QUERY PLAN select <col1> from <TABLE> where <col1> like '%monit%' AND <col2> like '%84%' GROUP BY <col1> limit 0,501;
It returns the output as:
0|0|0|SCAN TABLE USING INDEX (~250000 rows)
and when I drop the index, the output this EXPLAIN QUERY PLAN produces is:
0|0|0|SCAN TABLE (~250000 rows)
0|0|0|USE TEMP B-TREE FOR GROUP BY
Isn't the number of rows that are scanned (~250000 rows) were supposed to be lesser when index was used in searching the table???
I guess the problem here is with LIKE keyword, because I have read somewhere that LIKE keyword nullifies the use if indices... Here is the link
EDIT: For indices to work on a query which is using LIKE, The right-hand side of the LIKE must be a string literal that does not begin with a wildcard character. So, in the above query, I tried using search parameter in like without '%' at the beginning:
EXPLAIN QUERY PLAN select <col1> from <TABLE> where <col1> like 'monit%' AND <col2> like '84%' GROUP BY <col1> limit 0,501;
and the output I got was this:
0|0|0|SEARCH TABLE partnumber USING INDEX model_index_partnumber (model>? AND model
so,you see. The number of rows being searched (rather than scan) are (~15625 rows) in this.
But the problem now is I cannot do away with % wild card at the beginning. Anyone pls suggest me an alternative way to achieve the same....
EDIT:
I have tried using FTS3 from terminal but when I typed this query:
CREATE VIRTUAL TABLE <tbl> USING FTS3 (<col_list>);
Its throwing error as:
Error: no such module: FTS3
Someone pls help me to enable FTS3 from terminal as well as XCode (need the steps I must perform for both tasks).
I am using sqlcipher and have already perform this from terminal:
CFLAGS="-DSQLITE_ENABLE_FTS3=1" ./configure
EDIT:
Please visit the question sqlite table taking time to fetch the records in LIKE query posted by me
EDIT:
Hey All, I got some success. I modified my select query to look like this:
select distinct description collate nocase as description from partnumber where rowid BETWEEN 1 AND (select max(rowid) from partnumber) AND description like '%a%' order by description;
And Bingo, the search time was like never before. But the problem now is when I execute the command EXPLAIN QUERY PLAN like this, it shows me using B-Tree for distinct which I dont want to use.
explain query plan select distinct description collate nocase as description from partnumber where rowid BETWEEN 1 AND (select max(rowid) from partnumber) AND description like '%a%' order by description;
Output:
0|0|0|SEARCH TABLE partnumber USING INTEGER PRIMARY KEY (rowid>? AND rowid<?) (~15625 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE partnumber USING INTEGER PRIMARY KEY (~1 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
A couple other options ...
Full Text Indexes:
http://sqlite.org/fts3.html
The most common (and effective) way to describe full-text searches is
"what Google, Yahoo and Altavista do with documents placed on the
World Wide Web".
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
Word Breaking:
If you're looking for words (or words that start with), you can break text blobs into words yourself and store your own indexed word tables. But even then, you'll be able to only do word like 'monit%' to get hits like "monitor"
If possible, use the full text - it will be much less code. But, if that's not an option for some reason, then you can fall back to your own word breaking tables but that's limited words begins with to avoid scans. (better than whole text block begins with).
Be aware that the sqlite that comes with iOS does not have Full Text enabled. You can work around that. There's instructions on that and it's use at:
http://longweekendmobile.com/2010/06/16/sqlite-full-text-search-for-iphone-ipadyour-own-sqlite-for-iphone-and-ipad/
The full docs on creating and querying full text tables are here: http://sqlite.org/fts3.html
To get FTS3 to also work from terminal, see:
Compiling the command line interface # http://www.sqlite.org/howtocompile.html
sqlite3 using fts3 create table in my mac terminal and how to use it in iphone xcode project?
This is quite simple. You are telling SQLITE to examine every record in the table. It is faster to do this without using an index, because using an index wuld involve additional IO. And index is used when you want to examine a subset of the records in a table where the extra IO of using the index is paid back by not having to examine every record in the table.
When you say LIKE "%something" that means all records with anything at all at the beginning of the field, followed by something. The only way to do this is to examine every single record. Note that indexes should still be used if you only use LIKE "something%" because in this case, SQLITE can use the index to find the subset of records beginning with "something". In the old days when databases where not so clever we used to write it like this to enforce the use of an index. SELECT * WHERE col1 >= "something" AND col1 < "somethinh", Note the intentional mispelling of something in the second condition.
If you can it is best to avoid using % at the beginning of a LIKE condition. In some cases you may be able to change your schema so that data is stored in two columns rather than one. Then you use a LIKE "something%" search on the second of the two columns. Of course this depends on your data being structured right.
But even if splitting into two columns is not possible, it may be possible to divide and conquer the data in another way. For instance you could split the search fields into words, and index every word in a single column in another search table. That way "look for something or other" becomes a list of records where "something" is an exact match on a record in the search table. No LIKE required. You would then get a record ID to retrieve the original record. This is one of the things that SOLR does internally so if you must stick with SQLITE and cannot leverage SOLR or LUCENE in any way, then you can always read up on how they build inverted indices and do the same thing yourself in your SQLITE db.
Remember that LIKE "%something%" must examine every record, but if you can select a subset of the data first, and then apply the LIKE search, this will run a lot faster. Filling the cache will have the same effect which is what your experiments with DISTINCT were doing. Maybe all you need to do is to enlarge the cache to get acceptable search times. The first search will still be slow, but people are often quite forgiving of problems which go away when you retry it.
When you use arbitrary wildcards like that you are getting very close to a full text search engine requirement like SOLR. These work by indexing the data 100% in RAM. With SQLITE you might be able to do something similar by creating a second in-memory database, reading all data from the disk tables into the in-memory db, then using the in-memory db for searching with wildcards. You would still have full-table scans with queries such as LIKE "%monit%" however that scan takes place in RAM where it is not as timeconsuming. You don't need to import all your data into RAM, only the parts where you need "%something%" searches, because SQLITE can do cross-database joins. SQLITE makes it easy to create an in-memory database, and the ATTACH DATABASE and DETACH DATABASE commands make it easy to connect a second database to your app. There is some example code for IOS in this question Can iPhone sqlite apps attach to other databases?
Not sure why you don't like EXPLAIN using B-Trees since the b-tree is probably the fastest possible search structure available when your data has to be read from a filesystem.
I have a MySQL book that suggests REVERSE() the text (and if your application permits, store in a column). Then search the reversed text using LIKE(REVERSE('%something')).

How to randomize (and paginate) large set of results?

I am creating a contest application that requires the main index page of entries to be randomized. As it will potentially be a large set of entries (maybe up to 5000), I will also need to paginate them.
Here are the challenges:
I have read that using a database's 'random()' function on a large set can perform poorly.
I would like for things to not be re-randomized when the pagination links are clicked. In other words, it should return a random set upon first load and then keep the same order while someone uses the pagination.
The second challenge seems potentially unrealistic, but perhaps there are some create solutions out there?
Thanks for any input.
a simple way I suggest is writing your own random function with SQL query, for the function more complicated the more random, for example:
you already know
select * from your_table order by rand() limit 0, 10
assume your_table has a primary key "id", now replace "rand()" with "MOD(id, 13)"
select * from your_table order by MOD(id, 13) limit 0,10
if your_table has a datetime column, the result would be better, try this query:
select * from your_table order by MOD(id, 13), updated_at limit 0,10
also if you don't think it's not random enough, there is I bet you love it:
select * from your_table order by MD5(id) limit 0, 10
I would just use a random number generator to select IDs, and store the seed in the session so a user will see the same ordering while paginating. I would probably also use a hash to make sure each ID is picked only once.

How to Sum calulated fields

I‘d like to ask I question that here that I think would be easy to some people.
Ok I have query that return records of two related tables. (One to many)
In this query I have about 3 to 4 calculated fields that are based on the fields from the 2 tables.
Now I want to have a group by clause for names and sum clause to sum the calculated fields but it ends up in error message saying:
“You tried to execute a query that is not part of aggregate function”
So I decided to just run the query without the totals *(ie no group by , sum etc,,,)
:
And then I created another query that totals my previous query. ( i.e. using group by clause for names and sum for calculated fields… no calculation here) This is fine ( I use to do this) but I don’t like having two queries just to get summary total. Is their any other way of doing this in the design view and create only one query?.
I would very much appreciate.
Thankyou:
JM
Sounds like the query is thinking the calculated fields need to be part of the grouping or something. You might need to look into sub-querying.
Can you post the sql (before and after). It would help in getting an understanding of what the issue is.

Resources