I'm using rails with the oracleenhanced adaptor to create a new interface for a legacy application.
Database migrations work successfully, but take an incredibly long amount of time before rake finishes. The database changes happen pretty quickly (1 or 2 seconds), but the db/schema.db dump takes over an hour to complete. (See example migration below.)
It's a relatively large schema (about 150 tables), but I'm sure it shouldn't be taking this long to dump out each table description.
Is there anyway to speed this up by just taking the last schema.db and applying the change specified in the migration to it? Or am I able to skip this schema dump altogether?
I understand this schema.db is used to create the test database from scratch each time, but this case, there's a large chunk of the database logic in table triggers which aren't included in the schema.rb anyway, so the rake tests are no good to us in any case. (That's a whole different issue that I need to sort out at some other point.)
dgs#dgs-laptop:~/rails/voyager$ time rake db:migrate
(in /home/dgs/rails/voyager)
== 20090227012452 AddModuleActionAndControllerNames: migrating ================
-- add_column(:modules, :action_name, :text)
-> 0.9619s
-> 0 rows
-- add_column(:modules, :controller_name, :text)
-> 0.1680s
-> 0 rows
== 20090227012452 AddModuleActionAndControllerNames: migrated (1.1304s) =======
real 87m12.961s
user 0m12.949s
sys 0m2.128s
After all migrations are applied to database then rake db:migrate calls db:schema:dump task to generate schema.rb file from current database schema.
db:schema:dump call adapter's "tables" method to get the list of all tables, then for each table calls "indexes" method and "columns" method. You can find SQL SELECT statements that are used in these methods in activerecord-oracle_enhanced-adapter gem's oracle_enhanced_adapter.rb file. Basically it does selects from ALL% or USER% data dictionary tables to find all the information.
Initially I had issues with original Oracle adapter when I used it with databases with lot of different schemas (as performance might be affected by the total number of table in the database - not just in your schema) and therefore I did some optimizations in Oracle enhanced adapter. It would be good to find out which methods are slow in your case (I suspect that it could be either "indexes" or "columns" method which is executed for each table).
One way hoe to debug this issue would be if you would put some debug messages in oracle_enhanced_adapter.rb file so that you could identify which method calls are taking so long time.
Problem mostly solved after some digging round in oracle_enhanced_adapter.rb.
The problem came down to way too many tables in the local schema (many EBA_%, EVT_%, EMP_%, SMP_% tables had been created in there coincidently at some point), archive tables being included in the dump and a select from the data dictionaries taking 14 seconds to execute.
To fix the speed, I did three things:
Dropped all unneeded tables (about 250 out of 500)
Excluded archive tables from the schema dump
Cached the result of the long running query
This improved the time from the migration/schema dump for the remaining 350 tables from about 90 minutes to about 15 seconds. More than fast enough.
My code as follows (for inspiration not copying and pasting - this code is fairly specific to my database, but you should be able to get the idea). You need to create the temp table manually. It takes about 2 or 3 minutes for me to do - still too long to generate with each migration, and it's fairly static anyway =)
module ActiveRecord
module ConnectionAdapters
class OracleEnhancedAdapter
def tables(name = nil)
select_all("select lower(table_name) from all_tables where owner = sys_context('userenv','session_user') and table_name not like 'A!_%' escape '!' ").inject([]) do | tabs, t |
tabs << t.to_a.first.last
end
end
# TODO think of some way to automatically create the rails_temp_index table
#
# Table created by:
# create table rails_temp_index_table as
# SELECT lower(i.index_name) as index_name, i.uniqueness,
# lower(c.column_name) as column_name, i.table_name
# FROM all_indexes i, user_ind_columns c
# WHERE c.index_name = i.index_name
# AND i.owner = sys_context('userenv','session_user')
# AND NOT exists (SELECT uc.index_name FROM user_constraints uc
# WHERE uc.constraint_type = 'P' and uc.index_name = i.index_name);
def indexes(table_name, name = nil) #:nodoc:
result = select_all(<<-SQL, name)
SELECT index_name, uniqueness, column_name
FROM rails_temp_index_table
WHERE table_name = '#{table_name.to_s.upcase}'
ORDER BY index_name
SQL
current_index = nil
indexes = []
result.each do |row|
if current_index != row['index_name']
indexes << IndexDefinition.new(table_name, row['index_name'], row['uniqueness'] == "UNIQUE", [])
current_index = row['index_name']
end
indexes.last.columns << row['column_name']
end
indexes
end
end
Related
I have Order model in which I have datetime column start and int columns arriving_dur, drop_off_dur, etc.. which are durations in seconds from start
Then in my model I have
class Order < ApplicationRecord
def finish_time
self.start + self.arriving_duration + self.drop_off_duration
end
# other def something_time ... end
end
I want to be able to do this:
Order.where(finish_time: Time.now..(Time.now+2.hours) )
But of course I can't, because there's no such column finish_time. How can I achieve such result?
I've read 4 possible solutions on SA:
eager load all orders and select it with filter - that would not work well if there were more orders
have parametrized scope for each time I need but that means soo much code duplication
have sql function for each time and bind it to model with select() - it's just pain
somehow use http://api.rubyonrails.org/classes/ActiveRecord/Attributes/ClassMethods.html#method-i-attribute ? But I have no idea how to use it for my case or whether it even solves the problem I have.
Do you have any idea or some 'best practice' how to solve this?
Thanks!
You have different options to implement this behaviour.
Add an additional finish_time column and update it whenever you update/create your time values. This could be done in rails (with either before_validation or after_save callbacks) or as psql triggers.
class Order < ApplicationRecord
before_validation :update_finish_time
private
def update_finish_time
self.finish_time = start_time + arriving_duration.seconds + drop_off_duration.seconds
end
end
This is especially useful when you need finish_time in many places throughout your app. It has the downside that you need to manage that column with extra code and it stores data you actually already have. The upside is that you can easily create an index on that column should you ever have many orders and need to search on it.
An option could be to implement the finish-time update as a postgresql trigger instead of in rails. This has the benefit of being independent from your rails application (e.g. when other sources/scripts access your db too) but has the downside of splitting your business logic into many places (ruby code, postgres code).
Your second option is adding a virtual column just for your query.
def orders_within_the_next_2_hours
finishing_orders = Order.select("*, (start_time + (arriving_duration + drop_off_duration) * interval '1 second') AS finish_time")
Order.from("(#{finishing_orders.to_sql}) AS orders").where(finish_time: Time.now..(Time.now+2.hours) )
end
The code above creates the SQL query for finishing_order which is the order table with the additional finish_time column. In the second line we use that finishing_orders SQL as the FROM clause ("cleverly" aliased to orders so rails is happy). This way we can query finish_time as if it was a normal column.
The SQL is written for relatively old postgresql versions (I guess it works for 9.3+). If you use make_interval instead of multiplying with interval '1 second' the SQL might be a little more readable (but needs newer postgresql version, 9.4+ I think).
I noticed that Rails can have concurrency issues with multiple servers and would like to force my model to always lock. Is this possible in Rails, similar to unique constraints to force data integrity? Or does it just require careful programming?
Terminal One
irb(main):033:0* Vote.transaction do
irb(main):034:1* v = Vote.lock.first
irb(main):035:1> v.vote += 1
irb(main):036:1> sleep 60
irb(main):037:1> v.save
irb(main):038:1> end
Terminal Two, while sleeping
irb(main):240:0* Vote.transaction do
irb(main):241:1* v = Vote.first
irb(main):242:1> v.vote += 1
irb(main):243:1> v.save
irb(main):244:1> end
DB Start
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 0 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:42:58.875973
After execution
Terminal One
irb(main):040:0> v.vote
=> 1
Terminal Two
irb(main):245:0> v.vote
=> 1
DB End
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 1 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:44:10.276601
Other Example
http://rhnh.net/2010/06/30/acts-as-list-will-break-in-production
You are correct that transactions by themselves don't protect against many common concurrency scenarios, incrementing a counter being one of them. There isn't a general way to force a lock, you have to ensure you use it everywhere necessary in your code
For the simple counter incrementing scenario there are two mechanisms that will work well:
Row Locking
Row locking will work as long as you do it everywhere in your code where it matters. Knowing where it matters may take some experience to get an instinct for :/. If, as in your above code, you have two places where a resource needs concurrency protection and you only lock in one, you will have concurrency issues.
You want to use the with_lock form; this does a transaction and a row-level lock (table locks are obviously going to scale much more poorly than row locks, although for tables with few rows there is no difference as postgresql (not sure about mysql) will use a table lock anyway. This looks like this:
v = Vote.first
v.with_lock do
v.vote +=1
sleep 10
v.save
end
The with_lock creates a transaction, locks the row the object represents, and reloads the objects attributes all in one step, minimizing the opportunity for bugs in your code. However this does not necessarily help you with concurrency issues involving the interaction of multiple objects. It can work if a) all possible interactions depend on one object, and you always lock that object and b) the other objects each only interact with one instance of that object, e.g. locking a user row and doing stuff with objects which all belong_to (possibly indirectly) that user object.
Serializable Transactions
The other possibility is to use serializable transaction. Since 9.1, Postgresql has "real" serializable transactions. This can perform much better than locking rows (though it is unlikely to matter in the simple counter incrementing usecase)
The best way to understand what serializable transactions give you is this: if you take all the possible orderings of all the (isolation: :serializable) transactions in your app, what happens when your app is running is guaranteed to always correspond with one of those orderings. With ordinary transactions this is not guaranteed to be true.
However, what you have to do in exchange is to take care of what happens when a transaction fails because the database is unable to guarantee that it was serializable. In the case of the counter increment, all we need to do is retry:
begin
Vote.transaction(isolation: :serializable) do
v = Vote.first
v.vote += 1
sleep 10 # this is to simulate concurrency
v.save
end
rescue ActiveRecord::StatementInvalid => e
sleep rand/100 # this is NECESSARY in scalable real-world code,
# although the amount of sleep is something you can tune.
retry
end
Note the random sleep before the retry. This is necessary because failed serializable transactions have a non-trivial cost, so if we don't sleep, multiple processes contending for the same resource can swamp the db. In a heavily concurrent app you may need to gradually increase the sleep with each retry. The random is VERY important to avoid harmonic deadlocks -- if all the processes sleep the same amount of time they can get into a rhythm with each other, where they all are sleeping and the system is idle and then they all try for the lock at the same time and the system deadlocks causing all but one to sleep again.
When the transaction that needs to be serializable involves interaction with a source of concurrency other than the database, you may still have to use row-level locks to accomplish what you need. An example of this would be when a state machine transition determines what state to transition to based on a query to something other than the db, like a third-party API. In this case you need to lock the row representing the object with the state machine while the third party API is queried. You cannot nest transactions inside serializable transactions, so you would have to use object.lock! instead of with_lock.
Another thing to be aware of is that any objects fetched outside the transaction(isolation: :serializable) should have reload called on them before use inside the transaction.
ActiveRecord always wraps save operations in a transaction.
For your simple case it might be best to just use a SQL update instead of performing logic in Ruby and then saving. Here is an example which adds a model method to do this:
class Vote
def vote!
self.class.update_all("vote = vote + 1", {:id => id})
end
This method avoids the need for locking in your example. If you need more general database locking check see David's suggestion.
You can do the following in your model like so
class Vote < ActiveRecord::Base
validate :handle_conflict, only: :update
attr_accessible :original_updated_at
attr_writer :original_updated_at
def original_updated_at
#original_updated_at || updated_at
end
def handle_conflict
#If we want to use this across multiple models
#then extract this to module
if #conflict || updated_at.to_f> original_updated_at.to_f
#conflict = true
#original_updated_at = nil
#If two updates are made at the same time a validation error
#is displayed and the fields with
errors.add :base, 'This record changed while you were editing'
changes.each do |attribute, values|
errors.add attribute, "was #{values.first}"
end
end
end
end
The original_updated_at is a virtual attribute that is set. handle_conflict is fired when the record is updated. Checks to see if the updated_at attribute is in the database is later than the one hidden(defined on your page). By the way you should define the following in the your app/view/votes/_form.html.erb
<%= f.hidden_field :original_updated_at %>
If a there is a conflict then raise the validation error.
And if you are using Rails 4 you will won't have the attr_accessible and will need to add :original_updated_at to your vote_params method in your controller.
Hopefully this sheds some light.
For simple +1
Vote.increment_counter :vote, Vote.first.id
Because vote was used both for the table name and the field, this is how the 2 are used
TableName.increment_counter :field_name, id_of_the_row
Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end
This code should update the entire table by applying a filter to its "name" values:
entries = select('id, name').all
entries.each do |entry|
puts entry.id
update(entry.id, { :name => sanitize(entry.name) })
end
I am pretty new to Ruby on Rails and found it interesting, that my selection query is split into the single row selections:
SELECT `entries`.* FROM `entries` WHERE (`entries`.`id` = 1) LIMIT 1
SELECT `entries`.* FROM `entries` WHERE (`entries`.`id` = 2) LIMIT 1
SELECT `entries`.* FROM `entries` WHERE (`entries`.`id` = 3) LIMIT 1
...
As I understand, it's a kind of optimization, provided by Rails - to select a row only when it's needed (every cycle) and not the all entries at once.
However, is it really more efficient in this case? I mean, if I have 1000 records in my database table - is it better to make 1000 queries than a single one? If not, how can I force Rails to select more than one row per query?
Another question is: not all rows are updated by this query. Does Rails ignore the update query, if the provided values are the same which already exist (in other words, if entry.name == sanitize(entry.name))?
ActiveRecord is an abstraction layer, but when doing certain operations (especially those involving large datasets) it is useful to know what is happening underneath the abstraction layer.
This is pretty much true for all abstractions. (see Joel Spolsky's classic article on leaky abstractions: http://www.joelonsoftware.com/articles/LeakyAbstractions.html )
To deal with the case in point here, Rails provides the update_all method
Entry.find_each do |entry|
#...
end
That fetches all entries (100 per query) and exposes each entry for your pleasure.
If attributes are not changed, Rails will not perform an UPDATE query.
I have a simple 4-column Excel spreadsheet that matches universities to their ID codes for lookup purposes. The file is pretty big (300k).
I need to come up with a way to turn this data into a populated table in my Rails app. The catch is that this is a document that is updated now and then, so it can't just be a one-time solution. Ideally, it would be some sort of ruby script that would read the file and create the entries automatically so that when we get emailed a new version, we can just update it automatically. I'm on Heroku if that matters at all.
How can I accomplish something like this?
If you can, save the spreadsheet as CSV, there's much better gems for parsing CSV files than for parsing excel spreadsheets. I found an effective way of handling this kind of problem is to make a rake task that reads the CSV file and creates all the records as appropriate.
So for example, here's how to read all the lines from a file using the old, but still effective FasterCSV gem
data = FasterCSV.read('lib/tasks/data.csv')
columns = data.remove(0)
unique_column_index = -1#The index of a column that's always unique per row in the spreadsheet
data.each do | row |
r = Record.find_or_initialize_by_unique_column(row[unique_column_index])
columns.each_with_index do | index, column_name |
r[column_name] = row[index]
end
r.save! rescue => e Rails.logger.error("Failed to save #{r.inspect}")
end
It does kinda rely on you having a unique column in the original spreadsheet to go off though.
If you put that into a rake task, you can then wire it into you're Capistrano deploy script, so it'll be run every time you deploy. the find_or_initialize should ensure you shouldn't get duplicate records.
Parsing newish Excel files isn't too much trouble using Hpricot. This will give you a two-dimensional array:
require 'hpricot'
doc = open("data.xlsx") { |f| Hpricot(f) }
rows = doc.search('row')
rows = rows[1..rows.length] # Skips the header row
rows = rows.map do |row|
columns = []
row.search('cell').each do |cell|
# Excel stores cell indexes rather than blank cells
next_index = (cell.attributes['ss:Index']) ? (cell.attributes['ss:Index'].to_i - 1) : columns.length
columns[next_index] = cell.search('data').inner_html
end
columns
end