How to randomize (and paginate) large set of results? - ruby-on-rails

I am creating a contest application that requires the main index page of entries to be randomized. As it will potentially be a large set of entries (maybe up to 5000), I will also need to paginate them.
Here are the challenges:
I have read that using a database's 'random()' function on a large set can perform poorly.
I would like for things to not be re-randomized when the pagination links are clicked. In other words, it should return a random set upon first load and then keep the same order while someone uses the pagination.
The second challenge seems potentially unrealistic, but perhaps there are some create solutions out there?
Thanks for any input.

a simple way I suggest is writing your own random function with SQL query, for the function more complicated the more random, for example:
you already know
select * from your_table order by rand() limit 0, 10
assume your_table has a primary key "id", now replace "rand()" with "MOD(id, 13)"
select * from your_table order by MOD(id, 13) limit 0,10
if your_table has a datetime column, the result would be better, try this query:
select * from your_table order by MOD(id, 13), updated_at limit 0,10
also if you don't think it's not random enough, there is I bet you love it:
select * from your_table order by MD5(id) limit 0, 10

I would just use a random number generator to select IDs, and store the seed in the session so a user will see the same ordering while paginating. I would probably also use a hash to make sure each ID is picked only once.

Related

Multiple Selects With Influxdb

I am trying to figure out if it's possible to run multiple select statements on influxdb data. I've looked at continuous queries, but am not sure if it's what I need, or if it even makes sense to use them.
I would like to run:
select * from series group by work_id limit 1;
Then take that data, and run
select * from new_series_from_prior_query where state = 'error'
First question, is this even possible? Second, if not, is there another way to get the desired result using influxdb. Basically I need to filter all work items by their work_id and most recent state. Then, depending on what filters are passed in, check if they match and return that data.
Any help is greatly appreciated. If I cannot get it to work, I will most likely have to switch out the database, would love to stick with influxdb though.
Influx just released 1.2 today, which has sub queries, that solve this issue.
SELECT * FROM (select * from workflows GROUP BY work_id limit 1) WHERE state = 'processed'.
This is what I was looking for.

Caching paginated data for scrolling interface and avoid client side duplicates

Basically here is the set up:
You have a number of marketplace items and you want to sort them by price. If the cache expires when someone is browsing, they will suddenly be presented potential duplicate entries. This seems like a really terrible public API experience and we are looking to avoid this problem.
Some basic philosophies I have seen include:
Reddit's, in which they track the last id seen by the client, but they still handle duplicates.
Will Paginate, which is a simple implementation that basically returns results based on a multiple of items you want returned and an offset
Then there are many varied solutions that involve Redis sorted sets, etc. But these also don't really solve the problem of how to remove the duplicate entries
Does anyone have a fairly reliable way to deal with paginating sorted, dynamic lists without dupicates?
If the items you need to paginate are sorted properly ( on unique values ) then the only thing you need to do is to select the results by that value instead of by offset.
simple SQL example
SELECT * FROM items LIMIT 10; /*page 1*/
lets say row #10 has id = 42 (and id is the primary key)
SELECT * FROM items WHERE id < 42 LIMIT 10; /* page 2*/
If you are using postgresql (probably mysql has same problem) this solves also the problem that using OFFSET sucks in terms of performances (OFFSET N LIMIT M needs to scan N rows!)
If sorting is not unique (eg. sorting on creation timestamp can lead to multiple items created at the same time) you are going to have the duplication problem

Fetch latest rows grouped by uniqe field value

I have a table of Books with an author_id field.
I'd like to fetch an array of Books which contain only one Book of every author. The one with the latest updated_at field.
The problem with straightforward approach like Books.all.group('author_id') on Postgres is that it needs all requested field in its GROUP BY block. (See https://stackoverflow.com/a/6106195/1245302)
But I need to get all Book objects one per author, the recent one, ignoring all other fields.
It seems to me that there's enough data for the DBMS to find exactly the rows I want,
at least I could do that myself without any other fields in GROUP BY block. :)
Is there any simple Rails 3 + Postgres (version < 9) or SQL implementation
independent way to get that?
UPDATE
Nice solution for Postgres:
books.unscoped.select('DISTINCT ON(author_id) *').order('author_id').order('updated_at DESC')
BUT! there still problem remains – results are sorted by author_id in the first place, but i need to sort by updated_at inside the same author_id-s (to find, say the top-10 recent book authors).
And Postgres doesn't allow you to change order of ORDER BY arguments in DISTINCT queries :(
I don't know Rails, but hopefully showing you the SQL for what you want will help get you to a way to generate the right SQL.
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC;
The DISTINCT ON (author_id) portion should not be confused with part of the result column list -- it just says that there will be one row per author_id. The list in a DISTINCT ON clause must be the leading portion of the ORDER BY clause in such a query, and the row which is kept is the one which sorts first based on the rest of the ORDER BY clause.
With a large number of rows this way of writing the query is usually much faster than any solution based on GROUP BY or window functions, often by an order of magnitude or more. It is a PostgreSQL extension, though; so it should not be used in code which is intended to be portable.
If you want to use this result set inside another query (for example, to find the 10 most recently updated authors), there are two ways to do that. You can use a subquery, like this:
SELECT *
FROM (SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC) w
ORDER BY updated_at DESC
LIMIT 10;
You could also use a CTE, like this:
WITH w AS (
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC)
SELECT * FROM w
ORDER BY updated_at DESC
LIMIT 10;
The usual advice about CTEs holds here: use them only where there isn't another way to write the query or if needed to coerce the planner by introducing an optimization barrier. The plans are very similar, but passing the intermediate results through the CTE scan adds a little overhead. On my small test set the CTE form is 17% slower.
This is belated, but in response to questions about overriding/resetting a default order, use .reorder(nil).order(:whatever_you_want_instead)
(I can't comment, so posting as an answer for now)

Indices not working on sqlite table

I am using indices on columns on which I am making a search. The indices are created like this:
CREATE INDEX index1 on <TABLE>(<col1> COLLATE NOCASE ASC)
CREATE INDEX index2 on <TABLE>(<col2> COLLATE NOCASE ASC)
CREATE INDEX index3 on <TABLE>(<col3> COLLATE NOCASE ASC)
Now, the select query to search for records is like this:
select <col1> from <TABLE> where <col1> like '%monit%' AND <col2> like '%84%' GROUP BY <col1> limit 0,501;
When I run EXPLAIN QUERY PLAN on my sqlite database like this:
EXPLAIN QUERY PLAN select <col1> from <TABLE> where <col1> like '%monit%' AND <col2> like '%84%' GROUP BY <col1> limit 0,501;
It returns the output as:
0|0|0|SCAN TABLE USING INDEX (~250000 rows)
and when I drop the index, the output this EXPLAIN QUERY PLAN produces is:
0|0|0|SCAN TABLE (~250000 rows)
0|0|0|USE TEMP B-TREE FOR GROUP BY
Isn't the number of rows that are scanned (~250000 rows) were supposed to be lesser when index was used in searching the table???
I guess the problem here is with LIKE keyword, because I have read somewhere that LIKE keyword nullifies the use if indices... Here is the link
EDIT: For indices to work on a query which is using LIKE, The right-hand side of the LIKE must be a string literal that does not begin with a wildcard character. So, in the above query, I tried using search parameter in like without '%' at the beginning:
EXPLAIN QUERY PLAN select <col1> from <TABLE> where <col1> like 'monit%' AND <col2> like '84%' GROUP BY <col1> limit 0,501;
and the output I got was this:
0|0|0|SEARCH TABLE partnumber USING INDEX model_index_partnumber (model>? AND model
so,you see. The number of rows being searched (rather than scan) are (~15625 rows) in this.
But the problem now is I cannot do away with % wild card at the beginning. Anyone pls suggest me an alternative way to achieve the same....
EDIT:
I have tried using FTS3 from terminal but when I typed this query:
CREATE VIRTUAL TABLE <tbl> USING FTS3 (<col_list>);
Its throwing error as:
Error: no such module: FTS3
Someone pls help me to enable FTS3 from terminal as well as XCode (need the steps I must perform for both tasks).
I am using sqlcipher and have already perform this from terminal:
CFLAGS="-DSQLITE_ENABLE_FTS3=1" ./configure
EDIT:
Please visit the question sqlite table taking time to fetch the records in LIKE query posted by me
EDIT:
Hey All, I got some success. I modified my select query to look like this:
select distinct description collate nocase as description from partnumber where rowid BETWEEN 1 AND (select max(rowid) from partnumber) AND description like '%a%' order by description;
And Bingo, the search time was like never before. But the problem now is when I execute the command EXPLAIN QUERY PLAN like this, it shows me using B-Tree for distinct which I dont want to use.
explain query plan select distinct description collate nocase as description from partnumber where rowid BETWEEN 1 AND (select max(rowid) from partnumber) AND description like '%a%' order by description;
Output:
0|0|0|SEARCH TABLE partnumber USING INTEGER PRIMARY KEY (rowid>? AND rowid<?) (~15625 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE partnumber USING INTEGER PRIMARY KEY (~1 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
A couple other options ...
Full Text Indexes:
http://sqlite.org/fts3.html
The most common (and effective) way to describe full-text searches is
"what Google, Yahoo and Altavista do with documents placed on the
World Wide Web".
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
Word Breaking:
If you're looking for words (or words that start with), you can break text blobs into words yourself and store your own indexed word tables. But even then, you'll be able to only do word like 'monit%' to get hits like "monitor"
If possible, use the full text - it will be much less code. But, if that's not an option for some reason, then you can fall back to your own word breaking tables but that's limited words begins with to avoid scans. (better than whole text block begins with).
Be aware that the sqlite that comes with iOS does not have Full Text enabled. You can work around that. There's instructions on that and it's use at:
http://longweekendmobile.com/2010/06/16/sqlite-full-text-search-for-iphone-ipadyour-own-sqlite-for-iphone-and-ipad/
The full docs on creating and querying full text tables are here: http://sqlite.org/fts3.html
To get FTS3 to also work from terminal, see:
Compiling the command line interface # http://www.sqlite.org/howtocompile.html
sqlite3 using fts3 create table in my mac terminal and how to use it in iphone xcode project?
This is quite simple. You are telling SQLITE to examine every record in the table. It is faster to do this without using an index, because using an index wuld involve additional IO. And index is used when you want to examine a subset of the records in a table where the extra IO of using the index is paid back by not having to examine every record in the table.
When you say LIKE "%something" that means all records with anything at all at the beginning of the field, followed by something. The only way to do this is to examine every single record. Note that indexes should still be used if you only use LIKE "something%" because in this case, SQLITE can use the index to find the subset of records beginning with "something". In the old days when databases where not so clever we used to write it like this to enforce the use of an index. SELECT * WHERE col1 >= "something" AND col1 < "somethinh", Note the intentional mispelling of something in the second condition.
If you can it is best to avoid using % at the beginning of a LIKE condition. In some cases you may be able to change your schema so that data is stored in two columns rather than one. Then you use a LIKE "something%" search on the second of the two columns. Of course this depends on your data being structured right.
But even if splitting into two columns is not possible, it may be possible to divide and conquer the data in another way. For instance you could split the search fields into words, and index every word in a single column in another search table. That way "look for something or other" becomes a list of records where "something" is an exact match on a record in the search table. No LIKE required. You would then get a record ID to retrieve the original record. This is one of the things that SOLR does internally so if you must stick with SQLITE and cannot leverage SOLR or LUCENE in any way, then you can always read up on how they build inverted indices and do the same thing yourself in your SQLITE db.
Remember that LIKE "%something%" must examine every record, but if you can select a subset of the data first, and then apply the LIKE search, this will run a lot faster. Filling the cache will have the same effect which is what your experiments with DISTINCT were doing. Maybe all you need to do is to enlarge the cache to get acceptable search times. The first search will still be slow, but people are often quite forgiving of problems which go away when you retry it.
When you use arbitrary wildcards like that you are getting very close to a full text search engine requirement like SOLR. These work by indexing the data 100% in RAM. With SQLITE you might be able to do something similar by creating a second in-memory database, reading all data from the disk tables into the in-memory db, then using the in-memory db for searching with wildcards. You would still have full-table scans with queries such as LIKE "%monit%" however that scan takes place in RAM where it is not as timeconsuming. You don't need to import all your data into RAM, only the parts where you need "%something%" searches, because SQLITE can do cross-database joins. SQLITE makes it easy to create an in-memory database, and the ATTACH DATABASE and DETACH DATABASE commands make it easy to connect a second database to your app. There is some example code for IOS in this question Can iPhone sqlite apps attach to other databases?
Not sure why you don't like EXPLAIN using B-Trees since the b-tree is probably the fastest possible search structure available when your data has to be read from a filesystem.
I have a MySQL book that suggests REVERSE() the text (and if your application permits, store in a column). Then search the reversed text using LIKE(REVERSE('%something')).

Rails - insert new data, or increment existing value with update

In my rails app, I have a "terms" model, that stores a term (a keyword), and the frequency with which it appears in a particular document set (an integer). Whenever a new document gets added to the set, I parse out the words, and then I need to either insert new terms, and their frequency, into the terms table, or I need to update the frequency of an existing term.
The easiest way to do this would be to do a find, then if it's empty do an insert, or if it's not empty, increment the frequency of the existing record by the correct amount. That's two queries per word, however, and documents with high word counts will result in a ludicrously long list of queries. Is there a more efficient way to do this?
You can do this really efficiently, actually. Well, if you're not afraid to tweak Rails's default table layout a bit, and if you're not afraid to generate your own raw SQL...
I'm going to assume you're using MySQL for your database (I'm not sure what other DBs support this): you can use INSERT ... ON DUPLICATE KEY UPDATE to do this.
You'll have to tweak your count table to get it to work, though - "on duplicate key" only refers to the primary key, and Rails's default ID, which is just an arbitrary number, won't help you. You'll need to change your primary key so that it identifies what makes each record unique - in your case, I'd say PRIMARY KEY(word, document_set_id). This might not be supported by Rails by default, but there's at least one plugin, and probably a couple more, if you don't like that one.
Once your database is set up, you can build one giant insert statement, and throw that at MySQL, letting the "on duplicate key" part of the query take care of the nasty existence-checking stuff for you (NOTE: there are plugins to do batch inserts, too, but I don;t know how they work - particularly in regards to "on duplicate key"):
counts = {}
#This is just demo code! Untested, and it'll leave in punctuation...
#document.text.split(' ').each do |word|
counts[word] ||= 0
counts[word] += 1
end
values = []
counts.each_pair do |word, count|
values << ActiveRecord::Base.send(:sanitize_sql_array, [
'(?, ?, ?)',
word,
#document.set_id,
count
])
end
#Massive line - sorry...
ActiveRecord::Base.connection.execute("INSERT INTO word_counts (word, document_set_id, occurences) VALUES ${values.join(', ')} ON DUPLICATE KEY UPDATE occurences = occurences + VALUES(occurences)")
And that'll do it - one SQL query for the entire new document. Should be much faster, half because you're only running a single query, and half because you've sidestepped ActiveRecord's sluggish query building.
Hope that helps!

Resources