Multi-column index vs seperate indexes vs partial indexes - ruby-on-rails

While working on my Rails app today I noticed that the paranoia gem says that indexes should be updated to add the deleted_at IS NOT NULL as a where on the index creation (github link). But It occurred to me that the inverted condition when I do want with_deleted, won't benefit from the index.
This makes me wonder...
I know that this is somewhat obtuse because the answer is obviously "it depends on what you need" but I am trying to get an idea of the differences between Multi-column index vs separate indexes vs partial indexes on my web app backed by PostgreSQL.
Basically, I have 2 fields that I am querying on: p_id and deleted_at. Most of the time I am querying WHERE p_id=1 AND deleted_at IS NOT NULL - but sometimes I only query WHERE p_id=1. Very seldom, I will WHERE p_id=1 AND deleted_at=1/1/2017.
So, Am I better off:
Having an index on p_id and a separate index on deleted_at?
Having an index on p_id but add 'where deleted_at IS NOT NULL'?
Having a combined index on p_id and deleted_at together?
Note: perhaps I should mention that p_id is currently a foreign key reference to p.id. Which reminds me, in Postgres, is it necessary for foreign keys to also have indexes (or do they get an index derived from being a foreign key constraint - I've read conflicting answers on this)?

The answer depends on
how often you use each of these queries, and how long they are allowed to run
if query speed is important enough that slow data changes can be tolerated.
The perfect indexes for the three clauses are:
WHERE p_id=1 AND deleted_at IS NOT NULL
CREATE INDEX ON mytable (p_id) WHERE deleted_at IS NOT NULL;
WHERE p_id=1 AND deleted_at=1/1/2017
CREATE INDEX ON mytable (p_id, deleted_at);
WHERE p_id=1
CREATE INDEX ON mytable (p_id);
The index created for 2. can also be used for 3., so if you need to speed up the second query as much as possible and a slightly bigger index doesn't bother you, create only the index from 2. for both queries.
However, the index from 3. will also speed up the query in 2., just not as much as possible, so if you can live with a slightly worse performance for the query in 2. and want the index as small and efficient as possible for the query in 3., create only the index in 3.
I would not create both the indexes from 2. an 3.; you should pick what is best for you.
The case with 1. is different, because that index can only be used for the first query. Create that index only if you want to speed up that query as much as possible, and it doesn't matter if data modifications on the table take longer, because an additional index has to be maintained.
Another indication to create the index in 1. is if only a small percentage of rows satisfies deleted_at IS NOT NULL. If not, the index in 1. doesn't have a great advantage over the one in 3., and you should just create the latter.
Having two separate indexes on the two columns is probably not the best choice – they can be used in combination only with a bitmap index scan, and it may well be that PostgreSQL only chooses to use one of the indexes (depends on the distribution, but probably the one on p_id), and the other one is useless.

Related

Effects of Rail's default_scope on performance

Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.

Neo4j auto increase schema index

It is recommended not to use Neo4j's id property because it may change, but rather create our own identifier. Then to identify my users, I plan to create a user_id property on the nodes labelled User and put an index on it. However, I cannot figure out a way to make it auto increase.
After some searching, I noticed there are two kinds of indexes in Neo4j, the schema index and the legacy index. Could anyone explain to me the difference between them? And is there a way to make my user_id index auto increase?
Schema indices are effectively labels, e.g. :User. You can also create indices on the properties of those labels if you wish. There's also no need to specify which index you're using as this is done automatically, in this case.
Legacy indices are the node indices that were around prior to Neo4j 2.0. They're a traditional index where you can specify what you're indexing and which properties they apply to, but, they're only used in START statements, which are optional (and on their way to deprecation).
For more detail, have a look here (http://docs.neo4j.org/chunked/stable/graphdb-neo4j-schema.html) and here (http://docs.neo4j.org/chunked/stable/indexing.html).
As for auto-incrementing, I'm unaware of any such functionality for user-defined index keys.
HTH

Optimize Searching Through Rails Database

I'm building a rails project, and I have a database with a set of tables.. each holding between 500k and 1M rows, and i am constantly creating new rows.
By the nature of the project, before each creation, I have to search through the table for duplicates (for one field), so i don't create the same row twice. Unfortunately, as my table is growing, this is taking longer and longer.
I was thinking that I could optimize the search by adding indexes to the specific String fields through which i am searching.. but I have heard that adding indexes increases the creation time.
So my question is as follows:
What is the trade off with finding and creating rows which contain fields that are indexed? I know adding indexes to the fields will cause my program to be faster with the Model.find_by_name.. but how much slower will it make my row creation?
Indexing slows down insertation of entries because its required to add the entry to the index and that needs some ressources but once added they speed up your select queries, thats like you said BUT maybe the b-tree isnt the right choice for you! Because the B-Tree indexes the first X units of the indexed subject. Thats great when you have integers but text search is tricky. When you do queries like
Model.where("name LIKE ?", "#{params[:name]}%")
it will speed up selection but when you use queries like this:
Model.where("name LIKE ?", "%#{params[:name]}%")
it wont help you because you have to search the whole string which can be longer than some hundred chars and then its not an improvement to have the first 8 units of a 250 char long string indexed! So thats one thing. But theres another....
You should add a UNIQUE INDEX because the database is better in finding duplicates then ruby is! Its optimized for sorting and its definitifly the shorter and cleaner way to deal with this problem! Of cause you should also add a validation to the relevant model but thats not a reason to let things lide with the database.
// about index speed
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You dont have a large set of options. I dont think the insert speed loss will be that great when you only need one index! But the select speed will increase propotionall!

Rails - insert new data, or increment existing value with update

In my rails app, I have a "terms" model, that stores a term (a keyword), and the frequency with which it appears in a particular document set (an integer). Whenever a new document gets added to the set, I parse out the words, and then I need to either insert new terms, and their frequency, into the terms table, or I need to update the frequency of an existing term.
The easiest way to do this would be to do a find, then if it's empty do an insert, or if it's not empty, increment the frequency of the existing record by the correct amount. That's two queries per word, however, and documents with high word counts will result in a ludicrously long list of queries. Is there a more efficient way to do this?
You can do this really efficiently, actually. Well, if you're not afraid to tweak Rails's default table layout a bit, and if you're not afraid to generate your own raw SQL...
I'm going to assume you're using MySQL for your database (I'm not sure what other DBs support this): you can use INSERT ... ON DUPLICATE KEY UPDATE to do this.
You'll have to tweak your count table to get it to work, though - "on duplicate key" only refers to the primary key, and Rails's default ID, which is just an arbitrary number, won't help you. You'll need to change your primary key so that it identifies what makes each record unique - in your case, I'd say PRIMARY KEY(word, document_set_id). This might not be supported by Rails by default, but there's at least one plugin, and probably a couple more, if you don't like that one.
Once your database is set up, you can build one giant insert statement, and throw that at MySQL, letting the "on duplicate key" part of the query take care of the nasty existence-checking stuff for you (NOTE: there are plugins to do batch inserts, too, but I don;t know how they work - particularly in regards to "on duplicate key"):
counts = {}
#This is just demo code! Untested, and it'll leave in punctuation...
#document.text.split(' ').each do |word|
counts[word] ||= 0
counts[word] += 1
end
values = []
counts.each_pair do |word, count|
values << ActiveRecord::Base.send(:sanitize_sql_array, [
'(?, ?, ?)',
word,
#document.set_id,
count
])
end
#Massive line - sorry...
ActiveRecord::Base.connection.execute("INSERT INTO word_counts (word, document_set_id, occurences) VALUES ${values.join(', ')} ON DUPLICATE KEY UPDATE occurences = occurences + VALUES(occurences)")
And that'll do it - one SQL query for the entire new document. Should be much faster, half because you're only running a single query, and half because you've sidestepped ActiveRecord's sluggish query building.
Hope that helps!

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources