Speeding up XML to MySQL with Nokogiri in Rails - ruby-on-rails

I'm writing large amounts of data from XML feeds to my MySQL database in my Rails 3 app using Nokogiri. Everything is working fine but it's slower than I would like.
Is there any way to speed up the process? This is simplified version of the script I'm using:
url = "http://example.com/urltoxml"
doc = Nokogiri::XML(open(url))
doc.xpath("//item").each do |record|
guid = record.xpath("id").inner_text
price = record.xpath("price").inner_text
shipping = record.xpath("shipping").inner_text
data = Table.new(
:guid => guid,
:price => price,
:shipping => shipping
)
if price != ""
data.save
end
end
Thnx in advance

I guess your problem is not from parsing XML, but is that you insert the records one by one in the DB, which is very costly.
Unfortunately, AFAIK Rails does not provide a native way to mass-insert records. There once was a gem that did it, but I can't get my hand back on it.
"Mass inserting data in Rails without killing your performance", though, provides helpful insights on how to do it manually.
If you go this way, don't forget to process your nodes in batches if you don't want to end with a single 999-bazillion-rows INSERT statement.

Related

Ruby: Hash: use one record attribute as key and another as value

Let's say I have a User with attributes name and badge_number
For a JavaScript autocomplete field I want the user to be able to start typing the user's name and get a select list.
I'm using Materialize which offers the JS needed, I just need to provide it the data in this format:
data: { "Sarah Person": 13241, "Billiam Gregory": 54665, "Stephan Stevenston": 98332 }
This won't do:
User.select(:name, :badge_number) => { name: "Sarah Person", badge_number: 13241, ... }
And this feels repetitive, icky and redundant (and repetitive):
user_list = User.select(:name, :badge_number)
hsh = {}
user_list.each do |user|
hsh[user.name] = user.badge_number
end
hsh
...though it does give me my intended result, performance will suck over time.
Any better ways than this weird, slimy loop?
This will give the desired output
User.pluck(:name, :badge_number).to_h
Edit
Though above code is one liner, it still have loop internally. Offloading such loops to database may improve the performance when dealing with too many rows. But there is no database agnostic way to achieve this in active record. Follow this answer for achieving this in Postgres
If your RDBMS is Postgresql, you can use Postgresql function json_build_object for this specific case.
User.select("json_build_object(name, badge_number) as json_col")
.map(&:json_col)
The whole json can be build using Postgresql supplied functions too.
User.select("array_to_json(array_agg(json_build_object(name, badge_number))) as json_col")
.limit(1)[0]
.json_col

Join two hash Tables (dynamoDb) with Ruby on Rails

I am new to Ruby for one project only - I need to join two tables with aws dynamodb. Basically the equivalent of sql left join. But since dynamodb apparently doesn't support I need to make it happen at the array level it seems.
Currently I am querying the one just fine, but I need to bring in this other table, but I'm having a heck of a time finding a simple example for ruby with rails without using ActiveRecord (to avoid causing an overhaul on pre-existing code).
client = Aws::DynamoDB::Client.new
response = client.scan(table_name: 'db_current')
#items = response.items
fake output to protect the innocent
db_current
{"machine_id"=>"pc-123435", "type_id"=>"t-56778"}
db_type
{"description"=>"Dell 5 Dev Computer", "Name"=>"Dell", "type_id"=>"t-56778"}
I thought I might have to make two:
client = Aws::DynamoDB::Client.new
db_c = client.scan(table_name: 'db_current')
#c_items = db_c.items
client = Aws::DynamoDB::Client.new
db_t = client.scan(table_name: 'db_type')
#t_items = db_c.joins(db_t['type_id']) <=== then merge them
here.
where I'll ultimately display description/name/machine_id
But sadly no luck.
I'm looking for suggestions. I'd prefer to keep it simple to really
understand (It might sound unreasonable, I don't want to pull in ActiveRecord just yet unless I'll be owning this project going forward).
I ended up doing it this way. There is probably a more elegant solution for those that are familiar with Ruby... that I am not.
basically for each of the items in the first hash array (table), I use the ID from that one to filter on the item for the 2nd hash array. Merging them in the process. then appending to a final destination which I'll use for my UI.
#c_by_id = Array.new
#b_items.each do |item|
pjoin = #c_items.first {|h| h['b_id'] == item['b_id']}
newjoin = item.merge(pjoin)
#c_by_id.append(newjoin)
end

Batch insertion in rails 3

I want to do a batch insert of few thousand records into the database (POSTGRES in my case) from within my Rails App.
What would be the "Rails way" of doing it?
Something which is fast and also correct way of doing it.
I know I can create the SQL query by string concatenation of the attributes but I want a better approach.
ActiveRecord .create method supports bulk creation. The method emulates the feature if the DB doesn't support it and uses the underlying DB engine if the feature is supported.
Just pass an array of options.
# Create an Array of new objects
User.create([{ :first_name => 'Jamie' }, { :first_name => 'Jeremy' }])
Block is supported and it's the common way for shared attributes.
# Creating an Array of new objects using a block, where the block is executed for each object:
User.create([{ :first_name => 'Jamie' }, { :first_name => 'Jeremy' }]) do |u|
u.is_admin = false
end
I finally reached a solution after the two answers of #Simone Carletti and #Sumit Munot.
Until the postgres driver supports the ActiveRecord .create method's bulk insertion, I would like to go with activerecord-import gem. It does bulk insert and that too in a single insert statement.
books = []
10.times do |i|
books << Book.new(:name => "book #{i}")
end
Book.import books
In POSTGRES it lead to a single insert statemnt.
Once the postgres driver supports the ActiveRecord .create method's bulk insertion in a single insert statement, then #Simone Carletti 's solution makes more sense :)
You can create a script in your rails model, write your queries to insert in that script
In rails you can run the script using
rails runner MyModelName.my_method_name
Is the best way that i used in my project.
Update:
I use following in my project but it is not proper for sql injection.
if you are not using user input in this query it may work for you
user_string = " ('a#ao.in','a'), ('b#ao.in','b')"
User.connection.insert("INSERT INTO users (email, name) VALUES"+user_string)
For Multiple records:
new_records = [
{:column => 'value', :column2 => 'value'},
{:column => 'value', :column2 => 'value'}
]
MyModel.create(new_records)
You can do it the fast way or the Rails way ;) The best way in my experience to import bulk data to Postgres is via CSV. What will take several minutes the Rails way will take several seconds using Postgres' native CSV import capability.
http://www.postgresql.org/docs/9.2/static/sql-copy.html
It even triggers database triggers and respects database constraints.
Edit (after your comment):
Gotcha. In that case you have correctly described your two options. I have been in the same situation before, implemented it using the Rails 1000 save! strategy because it was the simplest thing that worked, and then optimized it to the 'append a huge query string' strategy because it was an order of magnitude better performing.
Of course, premature optimization is the root of all evil, so perhaps do it the simple slow Rails way, and know that building a big query string is a perfectly legit technique for optimization at the expense of maintainabilty. I feel your real question is 'is there a Railsy way that doesn't involve 1000's of queries?' - unfortunately the answer to that is no.

Rails: Nokogiri issue, where to place code.

I have a question regarding applying Nokogiri into my Rails app. I'm trying to collect baseball stats from a website and display the data into the view. I'm successful in parsing the data, however, I am not sure where to store the code in a RESTful manner.
Currently, I'm collecting the stats, putting them in an array and then matching them with another array (by rank, team, league, etc.). The two arrays are then put into a hash. Is there a more efficient way to do this (as in parse the data and then assign the data as a hash value, while the rank, team, league, etc. are assigned as hash keys)?
Lastly, I had placed the Nokogiri call into my controllers, but I do believe there is a better way. Ryan Bate's Railscasts suggests putting the Nokogiri call into a rake task (/lib/tasks/). Since, I want the website to receive the new baseball stats daily, will I have to run the rake task regularly? Secondly, how would I best implement the data into the view.
Searching online brought the idea of putting this into a config/initializers, but I'm not sure if that's a better solution.
The following is the Nokogiri call:
task :fetch_mets => :environment do
require 'nokogiri'
require 'open-uri'
url = "http..."
doc = Nokogiri::html(open(url))
x = Array.new
doc.css('tr:nth-child(14) td').each do |stat|
x << stat.content
end
a = %w[rank team league games_total games_won games_lost ratio streak]
o = Hash[a.zip x]
statistics = Hash[o.amp{|(y,z)| [y.to_sym, z]}]
#order_stat = statistics.each{|key, value| puts #{key} is #{value}."}
end
Please let me know if I have to clarify anything, thanks so much.
create a table in your db called statistics and include all the keys in your hash (plus created_on and id). To save your stats do:
Statistic.new(statistics).save
Then in your view pull the one with the highest created_on.
For running rake tasks on a cron schedule take a look at whenever.
also it might be cleaner to do it more like:
keys = %w[rank team league games_total games_won games_lost ratio streak].map(&:to_sym)
values = doc.css('tr:nth-child(14) td').map(&:text)
statistics = Hash[keys.zip values]

Working with a large data object between ruby processes

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.
Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.
The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.
Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.
One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.
Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?
Here is the code I'm using to generate a hash similar to the one I'm working with:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
if rand(10) == 0
#a[r][c] = 1 # 10% chance of being 1
else
#a[r][c] = 0
end
end
end
#c = Marshal.dump(#a) # 1000 milliseconds
Marshal.load(#c) # 400 milliseconds
Update:
Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.
Presently I'm considering two options:
Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.
The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.
A good walkthrough through how best to implement #1 or #2 may receive best answer credit.
Update 2
I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!
A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.
Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.
# server.rb
require 'drb'
class InterestServer < Hash
include DRbUndumped # don't send the data over!
def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}
scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
end
scores.sort!
end
end
DRb.start_service nil, InterestServer.new
puts DRb.uri
DRb.thread.join
# client.rb
uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri
USERS_COUNT = 10_000
INTERESTS_COUNT = 500
# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }
# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end
# query at will
puts interest_server.closest(users.first[:id]).inspect
# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }
puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly
To run in terminal, start the server and give the output as the argument to the client:
$ ruby server.rb
druby://mal.lan:51630
$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]
[[45, 42], [45, 178902]]
Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.
be careful with memcache, it has some object size limitations (2mb or so)
One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.
If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.
More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.
Have you considered upping the memcache max object size?
Versions greater than 1.4.2
memcached -I 11m #giving yourself an extra MB in space
or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.
What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
key = "#{r}:#{c}"
if rand(10) == 0
Cache.set(key, 1) # 10% chance of being 1
else
Cache.set(key, 0)
end
end
end
This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Resources