Insert into Rails Database - ruby-on-rails

I'm new to Ruby on Rails and wanted to create a crawler that scrapes data and inserts it into the database. I'm currently using Heroku so I can't access the database directly and was wondering what the best way to integrate a crawler script into the RoR framework would be. I would be using an hourly or daily cron to run the script.

If you are using Rails on Heroku you can just use an ORM adapter like Datamapper or ActiveRecord. This then gives you access to your database but through a layer basically. If you need to send raw sql to the database you can but it's usually not recommended since the ORM's provide pretty much everything you need.
You would basically just create models within your rails application like normal and the associated fields in a table.
rails g model page meta_title:string page_title:string
rake db:migrate # This has to be run on heroku too "heroku rake db:migrate" after you have pushed your code up
Then in your crawler script you can create records by just using your model...
Page.create(:title => crawler[:title], :meta_title => crawler[:meta_title])
Normally you can use Whenever(https://github.com/javan/whenever) to manage your cronjobs but on Heroku I'm not sure how it works since I haven't set any up on Heroku before.

I'd suggest 1 of 2 options:
Use a ruby script that uses require rubygems along with other helper libraries (like Rails, ActiveRecord, whatever) you want to accomplish the task, and then cron that script.
If you're using Rails to also serve web apps, use the machine's hosts file so that a wget (or similar) on that machine will properly map requests to that instance of rails; from there, just set it up as a web app, and use the wget command in your CRON. Not terribly efficient, but if you're just looking for something quick and dirty based on an existing setup, that would work nicely. Just make sure to send STDOUT and STDERR to /dev/null so you don't end up amassing CRON files.

Related

Rails migration strategies on horizontally scaled apps

Assume I have an OpsWorks Rails application running with load and time based scaling.
What happens if I deploy code where there are multiple application servers running, in which multiple rake db:migrate is executed across the application servers?
Does Rails have any guard against this? OR would I specifically need to specify a single server that is responsible for running the migrations?
I am also curious to hear about migration strategies for Rails + RDS (Postgresql) on AWS.
The answer is Yes, Rails had a guard against this situation
First, the RDS itself can cache your queries. So for example, 2 of your instances running one CREATE INDEX query at the same time, the RDS itself can handle it.
Second, Rails automatically create a table named schema_migrations. When one of your instances running db:migrate, the table schema_migrations can help the other instances know that the database is already migrated up ( just like a version management system )
But, there is a bad practice that you wrote some custom queries in db/migrate/***.rb file, then your query might not be handled correctly by RDS.
If you have to do something like update data by custom domain, you should write a rake task and manually execute it.
So, if you only use db:migrate to update data structure, then everything already handled for you, even you had hundreds instances.
For more information, please refer Rails's document

Calling rake task on web request

In my Rails app, the user can upload a file and then what I need to do is: when the file is uploaded I want to start a rake task, which parses the file and feeds all the tables in the database, and depending on how large is the file, it can take some time. I assume sth like >> system "rake task_name" in my controller will do the work.
However is it the best practice? Is it safe? Because in that way any user would be starting a rake process. In this Railcast they recommend running rake in background. What would be an alternative or the common practice?
I am using Rails 3.2, so I couldn't use Active Job. I used Sidekiq and worked fine.
Check a very helpful tutorial here

Automatically restarting a Rails app and running Rake tasks

I have a Rails application that pulls information from various internal APIs and stores in a local SQLite DB. My rails application is essentially a glorified UI on top of this data pulled from multiple APIs.
For reasons outside the scope of this question, it is not straightforward to simply update the information in the DB on a periodic basis by re-querying the API. I'm basically forced to recreate the DB from scratch.
In essence I want to do something like this every X hours -
Automatically shut down the rails application
Put up a maintenance page ("Sorry, we'll be back in a few mins")
Drop the db, recreate it, and re-migrate it (rake db:drop; rake db:create; rake db:migrate)
Run my custom rake task that populates the tables again (rake myApp:update)
Re-start the application
This brings up a few questions
How do I have the app restart automatically every X hours? Should I schedule this externally using a cron job? Is there a rails way I can accomplish this?
How do I display a maintenance page if the app is down? Again, is this also an external re-direct I need to manage?
Most importantly, is there a good way to drop the tables and recreate them or should I be using rake tasks? Is there a way to call rake tasks at startups? I guess I could create a .rb process under config/initalizers that would run at startup (but only when Rails.env == 'production')?
Thanks for the help!
Just create a Cron task that runs periodically. That Cron task starts a shell stript that just does all the step you would run manually.
There is a gem (https://github.com/biola/turnout) at helps with the maintainance page. It provides rake tasks like rake maintenance:start and rake maintenance:end.
I think it is not necessary to drop the tables. Usually it should be enough to just delete all records and then create new records. If you really have to drop the database, it might be faster to just restore the database schema from a structure dump.
There is not a 'rails way' to reload a rails app every so often, since starting and stopping a rails app is outside the context of the app itself. A cron job is a fine way to go.
Typically a web server is running "in front" of the rails app, apache or nginix are common. The maintenance page would need to be implemented on that level (since the rails app is down for a moment remember), something like a temporary change to the configs and a reload should suffice. Then when bringing that app back online, restore the config to point to the rails app and reload again.
Using the rake tasks you have is fine, set the environment variable RAILS_ENV=production so they hit the right sqlitedb file. Don't bother attempting to call the rake tasks at rails start up, call them from the script called by your cron job, and then start the app after that.

Best Practices - RoR 4 Heroku - Cron to fill database each hour from external API

I have to call externals API to fill my database, hosted on Heroku each hours.
For this purpose, I've a ruby script that get all the data from externals API and output on the stdout. Now, I would like to store those results in my database, I have differents ways to do it (Please let a comment if you know a better way).
What I have (Constraints) :
Ruby on Rails Application running on Heroku
PG Database hosted on Heroku
"Cars" model, with "Title", "Description", "Price" attributes, and 1 other nested attribute from "Users" Model (So same schema in PG).
Ruby Script that query the differents externals API
Ruby Script have to be called each hours / 2 hours / days. The script is going to run for about 10 minutes -> 2 hours depending of the number of results
My 3 differents ways to do it :
Running the script on a EC2 Instance, and fill my database with external login directly to the database, not by the Ruby on Rails REST API.
The problem is that it never ask for the Ruby on Rails validators, so for example if my database changed, or if I have to validate some data, it won't.
Running the script on a EC2 Instance, and fill my database with cll to my RoR REST API, so filling the data with JSON / XML. The problem is that I think if I have > 1000 calls from the API, it can make my dynos suffer with high load.
Running my script on a specific dyno on Heroku (I need some informations, I can't find some informations on Heroku)
(Please let a comment if you know a better way)
What do you think ? I need something really evolutive, if tomorrow i change my "Cars" model, everything has to be easy to make the switch between old and new model.
Thank you.
I would think that the best approach would be to use a background process to perform the work. Gems like http://sidekiq.org/ and DelayedJob all have the ability to schedule jobs (which then reschedule themselves for 2 hours later in your case).
On Heroku, workers run seperate to your web dynos so won't interfere with the performance it also keeps things simple in that you don't need to expose an API since you'll have direct access to your models from the worker.
There are plenty of Heroku docs on this subject;
https://devcenter.heroku.com/articles/background-jobs-queueing
https://devcenter.heroku.com/articles/delayed-job
You can do this by writing your scripts as a Rake tasks and then use Heroku Scheduler to schedule your task(s) to run at specific intervals:
https://devcenter.heroku.com/articles/scheduler
You can separate your tasks by schedule if you have multiple, and then just add multiple schedulers. They run in one-off dynos (which you pay for at the normal rate), and since they're running from the same code base can leverage all your existing app code (models, libs, etc).

What's the Rails way to mirror an LDAP directory as a SQL table?

So I'm creating a Rails app that will function as a directory for contact information in our organization. But it's not as simple as it sounds.
We have a large LDAP-enabled directory which contains information for all of the users in our organization of tens of thousands.
We have a smaller, separate LDAP-enabled directory which contains additional information for our department of several hundred, as well as some information duplicating or taking precedence over fields in the larger directory.
We will want to hand edit some of this data to override some of the fields in our local directory, which will be represented by a SQL table in the Rails app.
The remote directories will be periodically mirrored as SQL tables, and the 3 tables (Organization, Department, Local) will be compared to choose the correct value displayed in the app.
I know that sounds ridiculous, but nothing can be done for it. Our organization is very decentralized, and this is the most straightforward way to get what we want.
This is my second Rails app, so I'm comfortable with most of the design and plumbing, but I don't know the best way to periodically poll the data from the remote directories and import it into our local SQL tables.
Where in my app should I periodically import data into my tables from LDAP? Do I use Rails? Should I do this in straight Ruby and run it as a cron job?
If you want to have the sync functionality as part of your Rails app, then you can create that logic in a separate model class (let's call it LDAPSynchroniser).
Then you can reuse it from multiple places, including:
Rake task for manual syncronisation;
Cron job Running the Rake task;
Trigger the synchronisation from the web application (take into account the time it takes to run!)
The rake task would look like:
task :cron => :ldapsync do
puts "Sync-ing with LDAP..."
status = LDAPSynchroniser.new.run
puts "done: #{status.to_s}"
end
The web application trigger would be a regular controller:
def LDAPSyncController < ...
# probably authentication is needed...
def sync
status = LDAPSynchroniser.new.run # or run it in a separate thread-ish
# respond with status
end
end
Now to answer your questions:
Where in my app should I periodically import data into my tables from LDAP?
Use rake task + cron.
Do I use Rails?
You probably need to boot rails, but you don't need to run the rails web server for that.
Although you might want to trigger the task from the web application itself.
Should I do this in straight Ruby and run it as a cron job?
Doing it in Rails would be a little bit easier as you already have your model and all you need.
With Plain Ruby it might be possible as well but I don't think it is worth the effort.

Resources