Rails: How to resume a rake task? - ruby-on-rails

I think rake task is not the keyword here, but I don't know the correct keyword for this problem.
articles = Article.all
articles.each do |article|
get_share(article) #use HTTParty, Nokogiri, etc.
if article.save
puts "#{article.url}, #{article.share}"
end
end
I have this script to get the share number of an url from Facebook, Twitter and other platform. However, sometimes the loop is interrupted, maybe my internet connection is broken, or maybe the parsing in nokogiri go wrong, or simply artilces are too many.
So, if I run the task again, it will start over from the beginning, which is really a waste of time.
Is it possible to let it pick up where the loop stoped(the specific article in this case), and start the script from there?
I can output article.id, and get the article like articles = Article.where(id > stoped_id), but is this a good solution? or if there is any elegant approach for it?

In order to do this, you're going to have to store, somehow, which articles you've updated. You could look at the updated_at field of the articles table, but that would include articles that have been updated via the normal operation of your site.
A super simple method is just to read/write a temp file. eg
tempfile = "/tmp/updated_article_ids.txt"
if File.exists?(tempfile)
#updated_ids = File.readlines(tempfile).collect{|l| l.chomp.to_i}
end
if #updated_ids.blank?
articles = Article.all
else
articles = Article.where(["id not in (?)", #updated_ids]).all
end
articles.each do |article|
get_share(article) #use HTTParty, Nokogiri, etc.
if article.save
puts "#{article.url}, #{article.share}"
File.open(tempfile, "a"){|f| puts article.id}
end
end
If you know that you want to start from scratch, delete the tempfile. Or, you could have a further test in the code to only use tempfile if it's less than a day old or something.

I think it's best to implement such tasks using some sort of a tool for this. I personally like Delayed Job.
If you're not keen on doing something like that, you can always rescue the exception and do logic around that - either save the id as you mentioned, or do some sort of a sleep-retry logic.

Related

rails hash extract first value from list

I am trying to normalize some data in an ETL process because the data we get is not consistent.
Annoying but i am here to learn.
currently we do something like:
received = datum[:quantity_received] || datum[:received_cases] || datum[:received_quantity]
Curious if there is a more ruby/rails way of doing this?
considering:
received = datum.values_at(:quantity_received,:received_cases,:received_quantity).compact.first
I don't think there is a much better solution. I'd try to define some helper methods (I'm not a long lines' supporter)
def value_you_need(datum)
datum.values_at(*keys_of_interest).find(&:itself)
end
def keys_of_interest
%i(quantity_received received_cases received_quantity)
end
received = value_you_need(datum)
object#itself method is present from ruby 2.2.0. Otherwise, go for compact.first.
Note a detail: if false is one of the values you care about the solution with compact.first is the only one correct of our three. But I took for granted you don't or the first option would be wrong.

How do I ensure correctness when using find_in_batches?

current my application have stat needs and I
make up a background job using rufus-scheduler and runs at 3:00
to batch process these records into CacheStat table. It's just like
any normal application's Weekly/Monthly Stat needs.
And I found out using find_each(say using User.find_each to iterate
all users), which invokes find_in_batches, I checkout the source code
of rails,
while records.any?
records_size = records.size
primary_key_offset = records.last.id
yield records
break if records_size < batch_size
if primary_key_offset
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
else
raise "Primary key not included in the custom select clause"
end
end
which the implentation is by comparing the primary-key,
my concern is the cocurrency,while I processing the batch,
whatif some records be inserted in-between?
does anybody have this kind of problem?
While I think, this code implementation may be be problemic,
because new records will always have larger PK and later in the
end will be find.
So this is what this kind of needs be implemented? If I want to
implement a batch stat processing by myself(without rails), then I
need to ensure have an integer primary key and using these fields to
compare(better not to use other kind of fields)?
(I was thinking of this because I'm kind of in the middle of switching
from mysql to mongo, so maybe later I need to implement this kind of
functionality by myself).
If I understand correctly, you can ensure correctness here by enforcing transactional isolation, e.g.
User.transaction do
User.find_each do |user|
user
end
end

Catching errors with Ruby Twitter gem, caching methods using delayed_job: What am I doing wrong?

What I'm doing
I'm using the twitter gem (a Ruby wrapper for the Twitter API) in my app, which is run on Heroku. I use Heroku's Scheduler to periodically run caching tasks that use the twitter gem to, for example, update the list of retweets for a particular user. I'm also using delayed_job so scheduler calls a rake task, which calls a method that is 'delayed' (see scheduler.rake below). The method loops through "authentications" (for users who have authenticated twitter through my app) to update each authorized user's retweet cache in the app.
My question
What am I doing wrong? For example, since I'm using Heroku's Scheduler, is delayed_job redundant? Also, you can see I'm not catching (rescuing) any errors. So, if Twitter is unreachable, or if a user's auth token has expired, everything chokes. This is obviously dumb and terrible because if there's an error, the entire thing chokes and ends up creating a failed delayed_job, which causes ripple effects for my app. I can see this is bad, but I'm not sure what the best solution is. How/where should I be catching errors?
I'll put all my code (from the scheduler down to the method being called) for one of my cache methods. I'm really just hoping for a bulleted list (and maybe some code or pseudo-code) berating me for poor coding practice and telling me where I can improve things.
I have seen this SO question, which helps me a little with the begin/rescue block, but I could use more guidance on catching errors, and one the higher-level "is this a good way to do this?" plane.
Code
Heroku Scheduler job:
rake update_retweet_cache
scheduler.rake (in my app)
task :update_retweet_cache => :environment do
Tweet.delay.cache_retweets_for_all_auths
end
Tweet.rb, update_retweet_cache method:
def self.cache_retweets_for_all_auths
#authentications = Authentication.find_all_by_provider("twitter")
#authentications.each do |authentication|
authentication.user.twitter.retweeted_to_me(include_entities: true, count: 200).each do |tweet|
# Actually build the cache - this is good - removing to keep this short
end
end
end
User.rb, twitter method:
def twitter
authentication = Authentication.find_by_user_id_and_provider(self.id, "twitter")
if authentication
#twitter ||= Twitter::Client.new(:oauth_token => authentication.oauth_token, :oauth_token_secret => authentication.oauth_secret)
end
end
Note: As I was posting this, I noticed that I'm finding all "twitter" authentications in the "cache_retweets_for_all_auths" method, then calling the "User.twitter" method, which specifically limits to "twitter" authentications. This is obviously redundant, and I'll fix it.
First what is the exact error you are getting, and what do you want to happen when there is an error?
Edit:
If you just want to catch the errors and log them then the following should work.
def self.cache_retweets_for_all_auths
#authentications = Authentication.find_all_by_provider("twitter")
#authentications.each do |authentication|
being
authentication.user.twitter.retweeted_to_me(include_entities: true, count: 200).each do |tweet|
# Actually build the cache - this is good - removing to keep this short
end
rescue => e
#Either create an object where the error is log, or output it to what ever log you wish.
end
end
end
This way when it fails it will keep moving on to the next user but will still making a note of the error. Most of the time with twitter its just better to do something like this then try to do with each error on its own. I have seen so many weird things out of the twitter API, and random errors, that trying to track down every error almost always turns into a wild goose chase, though it is still good to keep track just in case.
Next for when you should use what.
You should use a scheduler when you need something to happen based on time only, delayed jobs for when its based on an user action, but the 'action' you are going to delay would take to long for a normal response. Sometimes you can just put the thing plainly in the controller also.
So in other words
The scheduler will be fine as long as the time between updates X is less then the time it will take for the update to happen, time Y.
If X < Y then you might want to look at calling the logic from the controller when each indvidual entry is accessed, isntead of trying to do them all at once. The idea being you would only update it after a certain time as passed so. You could store the last time update either on the model itself in a field like twitter_udpate_time or in a redis or memecache instance at a unquie key for the user/auth.
But if the individual update itself is still too long, then thats when you should do the above, but instead of doing the actually update, call a delayed job.
You could even set it up that it only updates or calls the delayed job after a certain number of views, to further limit stuff.
Possible Fancy Pants
Or if you want to get really fancy you could still do it as a cron job, but have a point system based on views that weights which entries should be updated. The idea being certain actions would add points to certain users, and if their points are over a certain amount you update them, and then remove their points. That way you could target the ones you think are the most important, or have the most traffic or show up in the most search results etc etc.
Next off a nick picky thing.
http://api.rubyonrails.org/classes/ActiveRecord/Batches.html
You should be using
#authentications.find_each do |authentication|
instead of
#authentications.each do |authentication|
find_each pulls in only 1000 entries at a time so if you end up with a lof of Authentications you don't end up pulling a crazy amount of entries into memory.

Filter cached DB queries from Rails' logs?

Is it possible to filter out DB queries that hit Rails' query cache from Rails logs? The presence of these "queries" makes it harder to debug performance issues
I was going to ask this question a few weeks ago, but then didn't get round to it. I'm not sure what the best way to do this is, but I'd think doing something like
ActiveRecord::Base.logger = QuietLogger.new
class QuietLogger < Logger
def add(severity, message = nil, progname = nil, &block)
super unless message ~= /CACHE/
end
end
I haven't tested this, but what you'd need to do is override the add method on the Logger and check for messages that contain CACHE. Hopefully I'm not that far off!
I recommend request-log-analyzer for analyzing rails performance from logs.

Profile a rails controller action

What is the best way to profile a controller action in Ruby on Rails. Currently I am using the brute-force method of throwing in puts Time.now calls between what I think will be a bottleneck. But that feels really, really dirty. There has got to be a better way.
I picked up this technique a while back and have found it quite handy.
When it's in place, you can add ?profile=true to any URL that hits a controller. Your action will run as usual, but instead of delivering the rendered page to the browser, it'll send a detailed, nicely formatted ruby-prof page that shows where your action spent its time.
First, add ruby-prof to your Gemfile, probably in the development group:
group :development do
gem "ruby-prof"
end
Then add an around filter to your ApplicationController:
around_action :performance_profile if Rails.env == 'development'
def performance_profile
if params[:profile] && result = RubyProf.profile { yield }
out = StringIO.new
RubyProf::GraphHtmlPrinter.new(result).print out, :min_percent => 0
self.response_body = out.string
else
yield
end
end
Reading the ruby-prof output is a bit of an art, but I'll leave that as an exercise.
Additional note by ScottJShea:
If you want to change the measurement type place this:
RubyProf.measure_mode = RubyProf::GC_TIME #example
Before the if in the profile method of the application controller. You can find a list of the available measurements at the ruby-prof page. As of this writing the memory and allocations data streams seem to be corrupted (see defect).
Use the Benchmark standard library and the various tests available in Rails (unit, functional, integration). Here's an example:
def test_do_something
elapsed_time = Benchmark.realtime do
100.downto(1) do |index|
# do something here
end
end
assert elapsed_time < SOME_LIMIT
end
So here we just do something 100 times, time it via the Benchmark library, and ensure that it took less than SOME_LIMIT amount of time.
You also may find these links useful: The Benchmark.realtime reference and the Test::Unit reference. Also, if you're into the 'book reading' thing, I picked up the idea for the example from Agile Web Development with Rails, which talks all about the different testing types and a little on performance testing.
There's a Railscast on profiling that's well worth watching
http://railscasts.com/episodes/98-request-profiling
You might want to give the FiveRuns TuneUp service a try, as it's really rather impressive. Disclaimer: I'm not associated with FiveRuns in any way, I've just tried this service out.
TuneUp is a free service whereby you download a plugin and when you run your application it injects a panel at the top of the screen that can be expanded to display detailed performance metrics.
It gives you some nice graphs, including one that shows what proportion of time is spent in the Model, View and Controller. You can even drill right down to see the individual SQL queries that ActiveRecord is executing if you need to and it can show you the underlying database schema with another click.
Finally, you can optionally upload your profiling data to the FiveRuns site for community performance analysis and advice.
This works in Rails 4.2.6:
o=OpenStruct.new(logger: Rails.logger)
o.extend ActiveSupport::Benchmarkable
o.benchmark 'name' do
# ... your code ...
end

Resources