Data integrity when updating/incrementing values in Rails - ruby-on-rails

I am working on an application in which one module accesses a log (an ActiveRecord model) to increment various values at several points of the execution. The problem is that the script is likely to take several seconds, and other instances of the same script is likely to run at the same time.
What I'm seeing is that while the script finished correctly and all data is accounted for, the values never are correct. This is likely because by the time the script gets to the point where it updates the value, the read value for the column (which is incremented) is already out of date.
The values are correct if I force it to only have one instance of the module at a time, but for performance reasons I can't keep doing this.
Currently, I've tried to solve the problem by querying the database for the record before each increment statement in a transaction, like this, where column is a symbol of the column and value is 1 or higher:
Log.transaction do
log = Log.find(#log_id)
log.update_attribute(column, log.send(column) + value)
end
However, it still won't give me accurate numbers. I'm thinking caching is involved, and I suppose I could try something like:
Log.transaction do
uncached do
log = Log.find(#log_id)
log.update_attribute(column, log.send(column) + value)
end
end
But surely I can't be the first to come across this issue, so I am wondering if there is a better implementation/solution for my problem?

ActiveRecord provides an update_counters method. You can change your code as follows:
Log.update_counters #log_id, column.to_sym => value

Related

.where statement to filter posts by date causing large number of database queries?

My first StackOverflow question, so pardon if it's a little rough around the edges. I recently began my first engineering job and have inherited some legacy Ruby on Rails code to work through.
My goal:
is to fetch posts (this is a model, though with no association to user) belonging to a user, as seen below. The posts should be filtered to only include an end_date that is nullor in the future
The problem:
The ActiveRecord query #valid_posts ||= Posts.for_user(user).where('end_date > ? OR end_date IS?', Time.now.in_time_zone, nil).pluck(post_set_id) (some further context below)
generates ~15 calls to my database per user per second when testing with Postman, causing significant memory spikes, notably with increased # of posts. I would only expect (not sure about this) 2 at most (one to fetch posts for the user, a second to fetch posts that match the date constraint).
In absence of the .where('end_date > ? OR end_date IS?', Time.now.in_time_zone, nil), there are no memory issues whatsoever. My question essentially, is why does this particular line cause so many queries to the database (which seems to be the cause of memory spikes), and what would be an improved implementation?
My reasoning thus far:
My initial suspicion was that I was making an N+1 query, though I no longer believe this to be the case (compared .select with .where in the query, no significant changes. A third option would possibly be to use .includes, though there is no association between a user and a post, and I do not believe that it would be feasible to generate one, as to my level of understanding, users are a function of an organization, not their own model.
My second thought is that because I am using a date that is precise to the millisecond, the time is ever changing, and therefore the updated time runs against the posts table every time there is a change in time (in this case every millisecond). Would it be possible to capture the current time in a variable and then pass this to the .where statement, rather than with a varying time, as is currently implemented? This would ultimately cause a sort of caching mechanism if I am not mistaken.
My third thought was to add an index to end_date on the posts table for quicker lookup, though in itself, I do not believe this to provide a solution.
Some basic context:
While there are many files working together, I have tried to overly-simplify them to essentially reflect the information that I believe is necessary to understand the issue at hand. If there is no identifiable cause for this issue, then perhaps I need to dig into other areas of code.
for_user is a user scope defined below:
user_scope
module UserScopable
extend ActiveSupport::Concern
...
scope(:for_user,
lambda { |poster|
for_user_scope(
{ user_id: poster.user_id, organization_id: poster.organization_id}
)
})
scope(:for_user_scope, lambda { |hash|
where(user_id: hash.fetch(:user_id), organization_id: hash.fetch(:organization_id))
})
#valid_posts is contained within a module, PostSetFilter and called in the user controller:
users_controller
def post_ids
post_pools = PostSetFilter.new(user: user)
render json: {PostPools: post_pools}
end
Ultimately, there's a lot that I do not know, and it seems like many approaches, so not entirely sure how to proceed. Any guidance about how to reduce the number of queries, and any reasoning as to why would be greatly appreciated.
I am happy to provide further context if needed, though everything points to the aforementioned line as being the culprit.. Thank you in advance.

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Performing multiple queries on the same model efficiently

I've been going round in circles for a few days trying to solve a problem which I've also struggled with in the past. Essentially its an issue of understanding the best (or an efficient) way to perform multiple queries on a model as I'm regularly finding my pages are very slow to load.
Consider the situation where you have a model called Everything. Initially you perform a query which finds those records in Everything which match certain criteria
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC')
I want to remember the contents of #chosenrecords as I will present them to the user as a list, however, I would also like to understand more of the attributes of #chosenrecords,for instance
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.count
When I use the above code in my controller and inspect the command history on the local server, I am surprised to find that each of the three queries involves an SQL query on the original Everything model, rather than remembering the records returned in #chosenrecords and performing the query on that. This seems very inefficient to me and indeed each of the three queries takes the same amount of time to process, making the page perform slowly.
I am more experienced in writing codes in software like MATLAB where once you've calculated the value of a variable it is stored locally and can be quickly interrogated, rather than recalculating that variable on each occasion you want to know more information about it. Please could you guide me as to whether I am just on the wrong track completely and the issues I've identified are just "how it is in Rails" or whether there is something I can do to improve it. I've looked into concepts like using a scope, defining a different variable type, and caching, but I'm not quite sure what I'm doing in each case and keep ending up in a similar hole.
Thanks for your time
You are partially on the wrong track. Rails 3 comes with Arel, which defer the query until data is required. In your case, you have generated Arel query but executing it with .first & then with .count. What I have done here is run the first query, get all the results in an array and working on that array in next two lines.
Perform the queries like this:-
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC').all
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.size
It will solve your issue.

Display a record sequentially with every refresh

I have a Rails 3 application that currently shows a single "random" record with every refresh, however, it repeats records too often, or will never show a particular record. I was wondering what a good way would be to loop through each record and display them such that all get shown before any are repeated. I was thinking somehow using cookies or session_ids to sequentially loop through the record id's, but I'm not sure if that would work right, or exactly how to go about that.
The database consists of a single table with a single column, and currently only about 25 entries, but more will be added. ID's are generated automatically and are sequential.
Some suggestions would be appreciated.
Thanks.
The funny thing about 'random' is that it doesn't usually feel random when you get the same answer twice in short succession.
The usual answer to this problem is to generate a queue of responses, and make sure when you add entries to the queue that they aren't already on the queue. This can either be a queue of entries that you will return to the user, or a queue of entries that you have already returned to the user. I like your idea of using the record ids, but with only 25 entries, that repeating loop will also be annoying. :)
You could keep track of the queue of previous entries in memcached if you've already got one deployed or you could stuff the queue into the session (it'll probably just be five or six integers, not too excessive data transfer) or the database.
I think I'd avoid the database, because it sure doesn't need to be persistent, it doesn't need to take database bandwidth or compute time, and using the database just to keep track of five or six integers seems silly. :)
UPDATE:
In one of your controllers (maybe ApplicationController), add something like this to a method that you run in a before_filter:
class ApplicationController < ActionController::Base
before_filter :find_quip
def find_quip:
last_quip_id = session[:quip_id] || Quips.find(:first).id
new_quip_id = Quips.find(last_quip.id + 1).id || Quips.find(:first)
session[:quip_id] = new_quip
end
end
I'm not so happy with the code to wrap around when you run out of quips; it'll completely screw up if there is ever a hole in the sequence. Which is probably going to happen someday. And it will start on number 2. But I'm getting too tired to sort it out. :)
If there are only going to be not too many like you say, you could store the entire array of IDs as a session variable, with another variable for the current index, and loop through them sequentially, incrementing the index.

Resources