I have a question regarding applying Nokogiri into my Rails app. I'm trying to collect baseball stats from a website and display the data into the view. I'm successful in parsing the data, however, I am not sure where to store the code in a RESTful manner.
Currently, I'm collecting the stats, putting them in an array and then matching them with another array (by rank, team, league, etc.). The two arrays are then put into a hash. Is there a more efficient way to do this (as in parse the data and then assign the data as a hash value, while the rank, team, league, etc. are assigned as hash keys)?
Lastly, I had placed the Nokogiri call into my controllers, but I do believe there is a better way. Ryan Bate's Railscasts suggests putting the Nokogiri call into a rake task (/lib/tasks/). Since, I want the website to receive the new baseball stats daily, will I have to run the rake task regularly? Secondly, how would I best implement the data into the view.
Searching online brought the idea of putting this into a config/initializers, but I'm not sure if that's a better solution.
The following is the Nokogiri call:
task :fetch_mets => :environment do
require 'nokogiri'
require 'open-uri'
url = "http..."
doc = Nokogiri::html(open(url))
x = Array.new
doc.css('tr:nth-child(14) td').each do |stat|
x << stat.content
a = %w[rank team league games_total games_won games_lost ratio streak]
o = Hash[a.zip x]
statistics = Hash[o.amp{|(y,z)| [y.to_sym, z]}]
#order_stat = statistics.each{|key, value| puts #{key} is #{value}."}
Please let me know if I have to clarify anything, thanks so much.

create a table in your db called statistics and include all the keys in your hash (plus created_on and id). To save your stats do:
Then in your view pull the one with the highest created_on.
For running rake tasks on a cron schedule take a look at whenever.
also it might be cleaner to do it more like:
keys = %w[rank team league games_total games_won games_lost ratio streak].map(&:to_sym)
values = doc.css('tr:nth-child(14) td').map(&:text)
statistics = Hash[keys.zip values]


How to add attribute/property to each record/object in an array? Rails

I'm not sure if this is just a lacking of the Rails language, or if I am searching all the wrong things here on Stack Overflow, but I cannot find out how to add an attribute to each record in an array.
Here is an example of what I'm trying to do:
#news_stories.each do |individual_news_story|
#user_for_record = User.where(:id => individual_news_story[:user_id]).pluck('name', 'profile_image_url');
individual_news_story.attributes(:author_name) = #user_for_record[0][0]
individual_news_story.attributes(:author_avatar) = #user_for_record[0][1]
Any ideas?
If the NewsStory model (or whatever its name is) has a belongs_to relationship to User, then you don't have to do any of this. You can access the attributes of the associated User directly:
#news_stories.each do |news_story|
news_story.user.name # gives you the name of the associated user
news_story.user.profile_image_url # same for the avatar
To avoid an N+1 query, you can preload the associated user record for every news story at once by using includes in the NewsStory query:
NewsStory.includes(:user)... # rest of the query
If you do this, you won't need the #user_for_record query — Rails will do the heavy lifting for you, and you could even see a performance improvement, thanks to not issuing a separate pluck query for every single news story in the collection.
If you need to have those extra attributes there regardless:
You can select them as extra attributes in your NewsStory query:
where(...) # rest of the query
It looks like you're trying to cache the name and avatar of the user on the NewsStory model, in which case, what you want is this:
#news_stories.each do |individual_news_story|
user_for_record = User.find(individual_news_story.user_id)
individual_news_story.author_name = user_for_record.name
individual_news_story.author_avatar = user_for_record.profile_image_url
A couple of notes.
I've used find instead of where. find returns a single record identified by it's primary key (id); where returns an array of records. There are definitely more efficient ways to do this -- eager-loading, for one -- but since you're just starting out, I think it's more important to learn the basics before you dig into the advanced stuff to make things more performant.
I've gotten rid of the pluck call, because here again, you're just learning and pluck is a performance optimization useful when you're working with large amounts of data, and if that's what you're doing then activerecord has a batch api you should look into.
I've changed #user_for_record to user_for_record. The # denote instance variables in ruby. Instance variables are shared and accessible from any instance method in an instance of a class. In this case, all you need is a local variable.

Join two hash Tables (dynamoDb) with Ruby on Rails

I am new to Ruby for one project only - I need to join two tables with aws dynamodb. Basically the equivalent of sql left join. But since dynamodb apparently doesn't support I need to make it happen at the array level it seems.
Currently I am querying the one just fine, but I need to bring in this other table, but I'm having a heck of a time finding a simple example for ruby with rails without using ActiveRecord (to avoid causing an overhaul on pre-existing code).
client = Aws::DynamoDB::Client.new
response = client.scan(table_name: 'db_current')
#items = response.items
fake output to protect the innocent
{"machine_id"=>"pc-123435", "type_id"=>"t-56778"}
{"description"=>"Dell 5 Dev Computer", "Name"=>"Dell", "type_id"=>"t-56778"}
I thought I might have to make two:
client = Aws::DynamoDB::Client.new
db_c = client.scan(table_name: 'db_current')
#c_items = db_c.items
client = Aws::DynamoDB::Client.new
db_t = client.scan(table_name: 'db_type')
#t_items = db_c.joins(db_t['type_id']) <=== then merge them
where I'll ultimately display description/name/machine_id
But sadly no luck.
I'm looking for suggestions. I'd prefer to keep it simple to really
understand (It might sound unreasonable, I don't want to pull in ActiveRecord just yet unless I'll be owning this project going forward).
I ended up doing it this way. There is probably a more elegant solution for those that are familiar with Ruby... that I am not.
basically for each of the items in the first hash array (table), I use the ID from that one to filter on the item for the 2nd hash array. Merging them in the process. then appending to a final destination which I'll use for my UI.
#c_by_id = Array.new
#b_items.each do |item|
pjoin = #c_items.first {|h| h['b_id'] == item['b_id']}
newjoin = item.merge(pjoin)

Rails: How to resume a rake task?

I think rake task is not the keyword here, but I don't know the correct keyword for this problem.
articles = Article.all
articles.each do |article|
get_share(article) #use HTTParty, Nokogiri, etc.
if article.save
puts "#{article.url}, #{article.share}"
I have this script to get the share number of an url from Facebook, Twitter and other platform. However, sometimes the loop is interrupted, maybe my internet connection is broken, or maybe the parsing in nokogiri go wrong, or simply artilces are too many.
So, if I run the task again, it will start over from the beginning, which is really a waste of time.
Is it possible to let it pick up where the loop stoped(the specific article in this case), and start the script from there?
I can output article.id, and get the article like articles = Article.where(id > stoped_id), but is this a good solution? or if there is any elegant approach for it?
In order to do this, you're going to have to store, somehow, which articles you've updated. You could look at the updated_at field of the articles table, but that would include articles that have been updated via the normal operation of your site.
A super simple method is just to read/write a temp file. eg
tempfile = "/tmp/updated_article_ids.txt"
if File.exists?(tempfile)
#updated_ids = File.readlines(tempfile).collect{|l| l.chomp.to_i}
if #updated_ids.blank?
articles = Article.all
articles = Article.where(["id not in (?)", #updated_ids]).all
articles.each do |article|
get_share(article) #use HTTParty, Nokogiri, etc.
if article.save
puts "#{article.url}, #{article.share}"
File.open(tempfile, "a"){|f| puts article.id}
If you know that you want to start from scratch, delete the tempfile. Or, you could have a further test in the code to only use tempfile if it's less than a day old or something.
I think it's best to implement such tasks using some sort of a tool for this. I personally like Delayed Job.
If you're not keen on doing something like that, you can always rescue the exception and do logic around that - either save the id as you mentioned, or do some sort of a sleep-retry logic.

Speeding up XML to MySQL with Nokogiri in Rails

I'm writing large amounts of data from XML feeds to my MySQL database in my Rails 3 app using Nokogiri. Everything is working fine but it's slower than I would like.
Is there any way to speed up the process? This is simplified version of the script I'm using:
url = "http://example.com/urltoxml"
doc = Nokogiri::XML(open(url))
doc.xpath("//item").each do |record|
guid = record.xpath("id").inner_text
price = record.xpath("price").inner_text
shipping = record.xpath("shipping").inner_text
data = Table.new(
:guid => guid,
:price => price,
:shipping => shipping
if price != ""
Thnx in advance
I guess your problem is not from parsing XML, but is that you insert the records one by one in the DB, which is very costly.
Unfortunately, AFAIK Rails does not provide a native way to mass-insert records. There once was a gem that did it, but I can't get my hand back on it.
"Mass inserting data in Rails without killing your performance", though, provides helpful insights on how to do it manually.
If you go this way, don't forget to process your nodes in batches if you don't want to end with a single 999-bazillion-rows INSERT statement.

Working with a large data object between ruby processes

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.
Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.
The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.
Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.
One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.
Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?
Here is the code I'm using to generate a hash similar to the one I'm working with:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
if rand(10) == 0
#a[r][c] = 1 # 10% chance of being 1
#a[r][c] = 0
#c = Marshal.dump(#a) # 1000 milliseconds
Marshal.load(#c) # 400 milliseconds
Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.
Presently I'm considering two options:
Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.
The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.
A good walkthrough through how best to implement #1 or #2 may receive best answer credit.
Update 2
I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!
A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.
Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.
# server.rb
require 'drb'
class InterestServer < Hash
include DRbUndumped # don't send the data over!
def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}
scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
DRb.start_service nil, InterestServer.new
puts DRb.uri
# client.rb
uri = ARGV.shift
require 'drb'
interest_server = DRbObject.new nil, uri
USERS_COUNT = 10_000
# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }
# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
# query at will
puts interest_server.closest(users.first[:id]).inspect
# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }
puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly
To run in terminal, start the server and give the output as the argument to the client:
$ ruby server.rb
$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]
[[45, 42], [45, 178902]]
Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.
be careful with memcache, it has some object size limitations (2mb or so)
One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.
If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.
More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.
Have you considered upping the memcache max object size?
Versions greater than 1.4.2
memcached -I 11m #giving yourself an extra MB in space
or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.
What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
key = "#{r}:#{c}"
if rand(10) == 0
Cache.set(key, 1) # 10% chance of being 1
Cache.set(key, 0)
This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.
