Rails - insert new data, or increment existing value with update - ruby-on-rails

In my rails app, I have a "terms" model, that stores a term (a keyword), and the frequency with which it appears in a particular document set (an integer). Whenever a new document gets added to the set, I parse out the words, and then I need to either insert new terms, and their frequency, into the terms table, or I need to update the frequency of an existing term.
The easiest way to do this would be to do a find, then if it's empty do an insert, or if it's not empty, increment the frequency of the existing record by the correct amount. That's two queries per word, however, and documents with high word counts will result in a ludicrously long list of queries. Is there a more efficient way to do this?

You can do this really efficiently, actually. Well, if you're not afraid to tweak Rails's default table layout a bit, and if you're not afraid to generate your own raw SQL...
I'm going to assume you're using MySQL for your database (I'm not sure what other DBs support this): you can use INSERT ... ON DUPLICATE KEY UPDATE to do this.
You'll have to tweak your count table to get it to work, though - "on duplicate key" only refers to the primary key, and Rails's default ID, which is just an arbitrary number, won't help you. You'll need to change your primary key so that it identifies what makes each record unique - in your case, I'd say PRIMARY KEY(word, document_set_id). This might not be supported by Rails by default, but there's at least one plugin, and probably a couple more, if you don't like that one.
Once your database is set up, you can build one giant insert statement, and throw that at MySQL, letting the "on duplicate key" part of the query take care of the nasty existence-checking stuff for you (NOTE: there are plugins to do batch inserts, too, but I don;t know how they work - particularly in regards to "on duplicate key"):
counts = {}
#This is just demo code! Untested, and it'll leave in punctuation...
#document.text.split(' ').each do |word|
counts[word] ||= 0
counts[word] += 1
end
values = []
counts.each_pair do |word, count|
values << ActiveRecord::Base.send(:sanitize_sql_array, [
'(?, ?, ?)',
word,
#document.set_id,
count
])
end
#Massive line - sorry...
ActiveRecord::Base.connection.execute("INSERT INTO word_counts (word, document_set_id, occurences) VALUES ${values.join(', ')} ON DUPLICATE KEY UPDATE occurences = occurences + VALUES(occurences)")
And that'll do it - one SQL query for the entire new document. Should be much faster, half because you're only running a single query, and half because you've sidestepped ActiveRecord's sluggish query building.
Hope that helps!

Related

Is there an efficient means of using ActiveRecord#Pluck for my situation?

I need to insert a lot of data into a new database. Like, a lot of data, so even nanoseconds count in the context of this query. I'm using activerecord-import to bulk-insert into Postgres, but that doesn't really matter for the scope of this question. Here's what I need:
I need an array that looks like this for each record in the existing DB:
[uuid, timestamp, value, unit, different_timestamp]
The issue is that the uuid is stored on the parent object that I'm looping through to get to this object, so #pluck works for each component aside from that. More annoying is that it is stored as an actual uuid, not a string, and needs to be stored as a uuid (not a string) in the new db as well. I'm not sure but I think using a SELECT inside of #pluck will return a string.
But perhaps the bigger issue is that I need to perform a conversion on the value of value before it is inserted again. It's a simple conversion, in effect just value / 28 or something, but I'm finding it hard to work that into #pluck without also tacking on #each_with_object or something (which slows this down considerably)
Here's the query as it is right now. It seems really silly to me to load the entire record based on the blockage outlined above. I hope there's an alternative.
Klass.find_each do |klass|
Data.where(token: klass.token).find_each do |data|
data << [
klass.uuid,
data.added_at,
data.value / conversion,
data.unit,
data.created_at
]
end
end
And no, the parent and Data are not associated right now and it's not an option, so I can't eager-load or just call Klass.data (they will be linked after this transition).
So ideally this is what I'm looking for:
Data.where(token: klass.token).pluck(:added_at, :value, :unit, :created_at)
But with the parameters outlined above.
I wonder if you can combine a SQL JOIN with pluck:
Klass
.joins('INNER JOIN datas ON datas.token = klasses.token')
.pluck('klasses.uuid', 'datas.added_at', "datas.value / #{conversion.to_f}", 'datas.unit', 'datas.created_at')

Multi-column index vs seperate indexes vs partial indexes

While working on my Rails app today I noticed that the paranoia gem says that indexes should be updated to add the deleted_at IS NOT NULL as a where on the index creation (github link). But It occurred to me that the inverted condition when I do want with_deleted, won't benefit from the index.
This makes me wonder...
I know that this is somewhat obtuse because the answer is obviously "it depends on what you need" but I am trying to get an idea of the differences between Multi-column index vs separate indexes vs partial indexes on my web app backed by PostgreSQL.
Basically, I have 2 fields that I am querying on: p_id and deleted_at. Most of the time I am querying WHERE p_id=1 AND deleted_at IS NOT NULL - but sometimes I only query WHERE p_id=1. Very seldom, I will WHERE p_id=1 AND deleted_at=1/1/2017.
So, Am I better off:
Having an index on p_id and a separate index on deleted_at?
Having an index on p_id but add 'where deleted_at IS NOT NULL'?
Having a combined index on p_id and deleted_at together?
Note: perhaps I should mention that p_id is currently a foreign key reference to p.id. Which reminds me, in Postgres, is it necessary for foreign keys to also have indexes (or do they get an index derived from being a foreign key constraint - I've read conflicting answers on this)?
The answer depends on
how often you use each of these queries, and how long they are allowed to run
if query speed is important enough that slow data changes can be tolerated.
The perfect indexes for the three clauses are:
WHERE p_id=1 AND deleted_at IS NOT NULL
CREATE INDEX ON mytable (p_id) WHERE deleted_at IS NOT NULL;
WHERE p_id=1 AND deleted_at=1/1/2017
CREATE INDEX ON mytable (p_id, deleted_at);
WHERE p_id=1
CREATE INDEX ON mytable (p_id);
The index created for 2. can also be used for 3., so if you need to speed up the second query as much as possible and a slightly bigger index doesn't bother you, create only the index from 2. for both queries.
However, the index from 3. will also speed up the query in 2., just not as much as possible, so if you can live with a slightly worse performance for the query in 2. and want the index as small and efficient as possible for the query in 3., create only the index in 3.
I would not create both the indexes from 2. an 3.; you should pick what is best for you.
The case with 1. is different, because that index can only be used for the first query. Create that index only if you want to speed up that query as much as possible, and it doesn't matter if data modifications on the table take longer, because an additional index has to be maintained.
Another indication to create the index in 1. is if only a small percentage of rows satisfies deleted_at IS NOT NULL. If not, the index in 1. doesn't have a great advantage over the one in 3., and you should just create the latter.
Having two separate indexes on the two columns is probably not the best choice – they can be used in combination only with a bitmap index scan, and it may well be that PostgreSQL only chooses to use one of the indexes (depends on the distribution, but probably the one on p_id), and the other one is useless.

How should you backfill a new table in Rails?

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!
Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

iterating through table in Ruby using hash runs slow

I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources