Ruby on Rails filling database with scraped data daily - ruby-on-rails

I'm trying to setup a process which will scrape web data from a set of websites on a set schedule (maybe monthly, daily...etc). I want it to then fill the database tables. What would be the best way to do this? Would it be best to create a ruby script outside of rails, and then use a cron task on my own schedule to fill the database? Or is there a way I can do this within the rails framework?

Step 1: Create a rake task
ie: lib/tasks/scrapping.rake
namespace :scrapping do
desc "Fetches new data from websites"
task scrap_websites: :environment do
# Call your scrapping classes/jobs/whatever code here
end
end
Step 2: Create a CRON task calling your rake task
You can use a gem like whenever for this: https://github.com/javan/whenever
For instance, your config/schedule.rb could look like this:
every 1.day, at: '4:00am' do
rake 'scrapping:scrap_websites'
end

Related

Where to put code that downloads and stores data?

I'm pretty much new to Rails and have a question about how to organize my code. I read about fat models and skinny controllers and it makes a lot of sense (in theory?).
What I'd like to do now is this: Periodically (via Cron and a Rails runner) download some data from the Internet and store parts of it in the database. What I don't understand is where to put the code which speaks with the API from which I get the data. Do I put it into the model and let it look like this:
API data
'--> Model --> Database
What about another case where the downloaded data has to be split up and stored in two different models / database tables? Which version do is choose?
Version 1:
API data
'--> Model --> Database
'--> Model --> Database
Version 2:
API data
'--> Controller
|--> Model --> Database
'--> Model --> Database
Thanks for your help! :)
As #agmcleod suggested you should go with raketask, which you run with rake task_name, and then add to it cron jobs
Start with:
rails g task api_service fetch_data_for_model1 fetch_data_for_model2
now open lib/tasks/api_service.rake
namespace :api_servicedo
desc "Update database with new Movies"
task :fetch_data_for_model1=> :environment do
puts 'start fetching data'
API.new(credentials).fetch_movies.each do |movie|
puts "creating movie id =#{movie.id}"
Movie.find_or_create_by(movie.attributes)
emd
puts 'Finisheed!'
end
desc "TODO"
task :fetch_data_for_model2=> :environment do
....
end
end
now open
crontab -e
and give it rake task path
00 00 * * * cd /Users/you/projects/myrailsapp && /usr/local/bin/rake RAILS_ENV=production api_service:fetch_data_for_model1
You may consider using https://github.com/javan/whenever which provides a clear syntax for writing and deploying cron jobs.
With whenever you would probably create schedule.rb file with such definitions
every 1.day, :at => '4:30 am' do
rake 'api_service:fetch_data_for_model1'
end

Rails - How to auto-transfer records from table to table at a specific time?

I am pre-storing records in a table A and I want to transfer these records from table A to table B automatically at a specific time, lets say on every evening at 08:00 PM.
Any ideas on how to solve this little problem?
You could create rake task to implement your job, and then schedule it with cron, default *nix time manager. Its syntax is difficult to remember, so I prefer to use Ruby wrapper around it, gem whenever.
You can use whenever gem to run cron jobs ...for example job that runs every 5 mins
in schedule.rb
every 5.minutes do
rake "transfer_data:send_data"
end
lib/tasks/send_data.rake
#!/usr/bin/env ruby
namespace :transfer_data do
desc "Rake task to transfer data
task :send_data => :environment do
## code to transfer data from one table to other table
end
end
Execute the task using bundle exec rake transfer_data:send_data

Invoke ActionMailer from cron job in Rails 3?

Is it possible to invoke ActionMailer from a cron job? We're using Rails 3.
Ideally, a cron job triggers a script to read users from a database, passes these users and some variables to an ActionMailer class to fire off emails.
Is this possible in Rails 3.2.12?
Yes it is possible. You could use a task to invoke with the rake command. Your task could be something like this:
# lib/tasks/cron.rake
namespace :cron do
desc "Send account emails"
task deliver_emails: :environment do
accounts_for_delivery = Account.where(condition: true)
# ... whatever logic you need
accounts_for_delivery.each do |account|
Postman.personalized_email_for(account).deliver
end
end
end
And your mailer and the corresponding view could look like this:
# app/mailers/postman.rb
class Postman < ActionMailer::Base
def personalized_email_for(account)
#account = account
mail to: account.email
end
end
# app/views/postman/personalized_email_for.text.haml
= #account.inspect
Now you can set the crontab to run your rake task just like you perform rake tasks. I recommend you use the whenever gem, that really provides a nice way to define cronjobs for your application that looks like this:
# config/schedule.rb
every 6.hours do
rake 'cron:deliver_email'
end
So now the cronjob definitions are bound your application. It works well with Capistrano between deployments as well. You can also pass variables at your task or execute system commands.
If everything else fails you can just create a normal controller action and let the cronjob call it with curl.
Otherwise any script in your Rails apps script folder can be started with rails runner script/myscript.rb from the commandline and has full access to all Rails features.
You can use rails r (rails runner) to run a script in your rails app. It runs it, loading in the full context of your rails app before doing so, so all your models etc. are available. I use it a lot. For example,
rails r utilities/some_data_massaging_script.rb
From cron, you'd obviously need to give it the full path to your app.
The old-fashioned way was to have something like:
require "#{File.dirname(__FILE__)}/../config/environment.rb"
at the top of your script (adjusting the relative bit of the path depending on the subdirectory level of your script in your app of course) and then just run your script using ruby, but rails r makes that unnecessary.

Scheduling in ruby on rails involving database access

I want to schedule daily reports to subscribed users via email.
For that I have written action in reports_controller that fetch data from database & convert it into pdf using pdfkit/wkhtmltopdf.The action works fine when called from get request.But when converted so that be defined like
def self.dailymail
ac = ActionController::Base.new()
kit = PDFKit.new #retrieve data from db
pdf = kit.to_pdf
ReportMailer.send_reports(ac.send_data(pdf)).deliver
end
It raises exception at send_data call when used with rufus scheduler:
RackDelegation#content_type= delegated to #_response.content_type=, but #_response is nil: #<ActionController::Base:0x206b068 #_routes=nil, #_action_has_layout=true, #_headers={"Content-Type"=>"text/html"}, ...
so, my question is what how can I solve this problem or Is there any alternate scheduler in rails that work fair on both Windows and Linux?
I wish to know any scheduler that can be helpful to send reports fetched from database.
I agree with claasz regarding the rake task. Check out the whenever gem https://github.com/javan/whenever
There is no suport for windows Task Scheduler, but it does support creating cron jobs.
Check out the documentation for the details, but esentially the gem creates cron jobs based on what you configure in the schedule.rb file that is created when you install the gem.
sample content of schedule.rb:
every 3.hours do
runner "MyModel.some_process"
rake "my:rake:task"
command "/usr/bin/my_great_command"
end
This would be like running bundle exec rake my:rake:task every 3 hours
After creating the schedule.rb you will need to run the whenever command from the console in order to add your schedule to cron. If you run whenever without arguments, the output shows you the contents of the schedule.rb. There is an argument you need to provide that I can't remember off the top of my head, just pass --help and I think you'll get the answer.
Hope this helps
EDIT:The argument is -w to write to cron-tab
As willglynn already points out, you should get rid of any controller interaction. There's simply no need here and it makes things unnecessarily complicated. So your code should look more like
def self.dailymail
kit = PDFKit.new #retrieve data from db
pdf = kit.to_pdf
ReportMailer.send_reports(pdf).deliver
end
If you got problems with the rufus scheduler (which I don't know), you could create a rake task to send out your mails and use the OS scheduler (e.g. cron on Linux) to call the task. Having the rake task would be also convenient for testing.

How do I run Ruby tasks that use my Rails models?

I have a Rails app with some basic models. The website displays data retrieved from other sources. So I need to write a Ruby script that creates new instances in my database. I know I can do that with the test hooks, but I'm not sure that makes sense here.
I'm not sure what this task should look like, how I can invoke it, or where it should go in my source tree (lib\tasks?).
For example, here's my first try:
require 'active_record'
require '../app/models/mymodel.rb'
test = MyModel.new
test.name = 'test'
test.save
This fails because it can't get a connection to the database. This makes sense in a vague way to my newbie brain, since presumably Rails is doing all the magic work behind the scenes to set all that stuff up. So how do I set up my little script?
You can load the entire rails environment in any ruby script by simply requiring environment.rb:
require "#{ENV['RAILS_ROOT']}/config/environment"
This assumes the RAILS_ROOT environment variable is set, see my comment for other ways of doing this.
This has the added bonus of giving you all the nice classes and objects that you have in the rest of your rails code.
To kick off your processes it sounds like cron will do what you want, and I would also add a task to your capistrano recipe that would add your script to the crontab to periodically get the data from the external source and update your DB. This can easily be done with the cronedit gem.
The cron approach does have some drawbacks, mostly overhead and control, for other more sophisticated options see HowToRunBackgroundJobsInRails from the rails wiki.
I agree with the answer above but you have to include => :environment in your task or it will not load the Rails environment.
e.g.,
namespace :send do
namespace :trial do
namespace :expiry do
desc "Sends out emails to people who's accounts are about to expire"
task :warnings => :environment do
User.trial_about_to_expire.has_not_been_notified_of_trial_expiry.each do |user|
UserMailer.deliver_trial_expiring_warning(user)
user.notified_of_trial_expiry = true
user.save
end
end
end
end
end
I'd suggest creating custom rake tasks (lib/task/foo.rake). This give you easy access to most of the functionality of your rails app.
namespace :foo do
desc 'do something cool'
def something_cool
test = MyModel.new
test.name = 'test'
test.save
end
end
Then:
$ rake -T foo
rake foo:something_cool # do something cool
You can even run the tasks via a cronjob.
I wrote up a post about this a while back.
http://www.rawblock.com/2007/06/14/ruby-oracle-mac-os-x-pain-jruby-and-activerecord-jdbc-to-the-rescue/
You can open a connection in your scripts as such:
ActiveRecord::Base.establish_connection(
:adapter => "mysql",
:username => "root",
:host => "localhost",
:password => "******",
:database => "******"
)
I'm sure there is a more elegant way to do it, so that it grabs the info from your database.yml.
There are few steps to this and more details needed to really answer well.
You say your site retrieves data from other sources? How often? If it is semi-regularly you definitely want to look into background processing/messaging. If it is frequently you really want to avoid loading your rails environment every time your script runs since you will be paying too high a startup tax each time.
There are a multitude of options out there you will want to research. Reading about each of them, particularly reviews from people who post about why they made the choice they did, will give you a good feel for what questions you need to ask yourself before you make your choice. How big a job is loading the data? etc...
Off the top of my head these are some of the things you may want to look into
Script/Runner & Cron
Background/RB
Starling
Workling
MemcacheQ
Beanstalk
Background Job (Bj)
delayed_job (Dj)
Daemon Generator
Check out my answer in "A cron job for rails: best practices?".
It contains two examples for using cron to run Rake tasks and class methods (via script/runner). In both cases, Rails is loaded and you can use your models.
Nice Joyent write up of using rake to run rails tasks from a cron job - http://wiki.joyent.com/accelerators:kb:rails:cron
Easiest way to run ruby tasks that interact with rails app/models is to make Rails generate Rake tasks for you!! :)
Here's an example
run rails g task my_namespace my_task
This will generate a file called lib/tasks/my_namespace.rake which looks like:
namespace :my_namespace do
desc "TODO: Describe your task here"
task :my_task1 => :environment do
#write any ruby code here and also work with your models
puts User.find(1).name
end
end
Run this task with rake my_namespace:my_task
Watch your ruby code task that interacts with rails modal run!

Resources