High-performance RSS/Atom parsing with Ruby on Rails - ruby-on-rails

I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!

I haven't tried it, but I read about Feedzirra recently (it claims to be built for performance) :-
Feedzirra is a feed library that is
designed to get and update many feeds
as quickly as possible. This includes
using libcurl-multi through the
taf2-curb gem for faster http gets,
and libxml through nokogiri and
sax-machine for faster parsing.

You can use RFeedParser, a Ruby-port of (famous) Python Universal FeedParser. It's based on Hpricot, and it's really fast and easy to use.
http://rfeedparser.rubyforge.org/
An example:
require 'rubygems'
require 'rfeedparser'
require 'open-uri'
feed = FeedParser::parse(open('http://feeds.feedburner.com/engadget'))
feed.entries.each do |entry|
puts entry.title
end

When all you have is a hammer, everything looks like a nail. Consider a solution other than Ruby for this. Though I love Ruby and Rails and would not part with them for web development or perhaps for a domain specific language, I prefer heavy data lifting of the type you describe be performed in Java, or perhaps Python or even C++.
Given that the destination of this parsed data is likely a database it can act as the common point between the Rails portion of your solution and the other language portion. Then you're using the best tool to solve each of your problems and the result is likely easier to work on and truly meets your requirements.
If speed is truly of the essence, why add an additional constraint on there and say, "Oh, it's only of the essence as long as I get to use Ruby."

Not sure about the performance, but a similar question was answered at Parsing Atom & RSS in Ruby/Rails?
You might also look into Hpricot, which parses XML but assumes that it's well-formed and doesn't do any validation.
http://wiki.github.com/why/hpricot
http://wiki.github.com/why/hpricot/hpricot-xml

initially i used nokogiri to do some basic xml parsing, but it was slow and erratic (at times) i switched to feedzirra and not only was there a great performance boost, there were no errors and its as easy as pie.
Example shown below
# fetching a single feed
feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
# feed and entries accessors
feed.title # => "Paul Dix Explains Nothing"
feed.url # => "http://www.pauldix.net"
feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
entry = feed.entries.first
entry.title # => "Ruby Http Client Library Performance"
entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
entry.author # => "Paul Dix"
entry.summary # => "..."
entry.content # => "..."
entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
entry.categories # => ["...", "..."]
if you want to do more with the feeds, for example parsing them, the following will suffice
source = Feedzirra::Feed.fetch_and_parse(http://www.feed-url-you-want-to-play-with.com)
puts "Parsing Downloaded XML....\n\n\n"
source.entries.each do |entry|
begin
puts "#{entry.summary} \n\n"
cleanURL = (entry.url).gsub("+","%2B") #my own sanitization process, ignore
scrapArticleWithURL(cleanURL)
rescue
puts "(****)there has been an error fetching (#{entry.title}) \n\n"
end

Related

Ruby currency exchange gems that work?

This is based on an earlier question that was resolved. I need to load sale prices for my ruby-based app in different currencies. I was recently using the gem google_currency to convert the prices based on the Google API. At some point recently it stopped working and I have no idea why. I have tried testing in various ways but can't work out what the problem is.
I am now trying to use the 'exchange' gem which has good documentation however the method I am using is not producing anything in the view files when running.
According to the exchange gem the simple conversion should be something like:
def exchange4
puts 10.in(:eur).to(:usd)
end
However it is not loading anything in the html view. Any suggestions including other working gems welcome!
Currently this code seems like it would pass however now Action Controller is telling me it doesn't know the conversion rates:
def exchange4(goods)
require 'money'
require 'money-rails'
exr = Money.new(1, goods.currency).exchange_to(buyer.currency)
puts exr
end
The error Action Controller is giving is:
No conversion rate known for 'GBP' -> 'EUR'
Very strange..
RubyMoney organization has a very good options to deal with currencies, money and exchange. I use money and it really works. For Rails integration they have money-rails.
Examples of exchange:
Money.us_dollar(100).exchange_to('EUR')
Money.new(100, 'USD').exchange_to('EUR')
You can use eu_central_bank gem (compatible with money) to extract all exchange rates. Example usage (in rails console):
>> bank = EuCentralBank.new
>> bank.update_rates # if bank.last_updated.blank? || bank.last_updated < 1.day.ago
>> Money.default_bank = bank
Then:
>> Money.new(1, 'GBP').exchange_to('EUR')
=> #<Money fractional:1 currency:EUR>

getting started with Stanford Parser in jruby

I'm looking to add some text parsing in my rails app, and have been going in circles for the past few days looking for any tutorials or hints as to how to get this working.
I am completely new to Java, but nothing like jumping in with both feet.
i suspect the following code doesn't belong in my controller, and should likely be in a model, but I'm just seeing if I've got all the pieces in the right place at this point.
I borrowed this code from this SO question, implementing custom java class in jruby, because I was having trouble finding any sort of example code.
#my requires/imports/includes, included multiple versions to be safe
require 'java'
#include Java
require '/media/sf_Ruby192/java_progs/parser/stanford-parser.jar'
#require '/media/sf_Ruby192/java_progs/parser/'
require 'rubygems'
include_class 'edu.stanford.nlp.parser.lexparser.LexicalizedParser'
class ParseController < ApplicationController
def index
lp = LexicalizedParser.new
#check if regular Java is working
list = java.util.ArrayList.new
a = "1"
b = "2"
list.add(a)
list.add(b)
d = list[0]
return render :text => list
end
end
unfortunately for me, I get the error
java.lang.NullPointerException: null
when I include the
lp = LexicalizedParser.new
am i doing EVERYTHING wrong? when I comment out the lp = ..., I get the list output, so jruby is working, and I can write java in my rails app and get the output.
can somebody point me in the right direction, maybe tell me what is wrong with this bit of code, but hopefully actually set me straight on how I'm supposed to be working with jruby and rails. Hopefully some input on Stanford Parser too (I know, it's a lot to ask). There seems to be very little by the way of documentation or example code that i've found.
I don't think so. But I do think that you need to read up on how this parser works.
According to http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/parser/lexparser/LexicalizedParser.html, the default constructor works as follows:
Construct a new LexicalizedParser object from a previously serialized
grammar read from a property
edu.stanford.nlp.SerializedLexicalizedParser, or a default file
location.
In other words, you are getting the NPE because the default constructor can't find enough information to create the parser.
If you grab the binary distribution from Stanford, appropriate grammars will be found in grammar directory. For example:
$ jruby -S irb
irb(main):001:0> require 'java'
=> true
irb(main):002:0> require 'stanford-parser.jar'
=> true
irb(main):003:0> java_import Java::edu.stanford.nlp.parser.lexparser.LexicalizedParser
=> Java::EduStanfordNlpParserLexparser::LexicalizedParser
irb(main):004:0> lp = LexicalizedParser.new("grammar/englishPCFG.ser.gz")
Loading parser from serialized file grammar/englishPCFG.ser.gz ... done [2.5 sec].
=> #<Java::EduStanfordNlpParserLexparser::LexicalizedParser:0x7d627b8b>

Rails - Paper_Clip - Support for Multi File Uploads

I have paper_clip installed on my Rails 3 app, and can upload a file - wow that was fun and easy!
Challenge now is, allowing a user to upload multiple objects.
Whether it be clicking select fileS and being able to select more than one. Or clicking a more button and getting another file upload button.
I can't find any tutorials or gems to support this out of the box. Shocking I know...
Any suggestions or solutions. Seems like a common need?
Thanks
Okay, this is a complex one but it is doable. Here's how I got it to work.
On the client side I used http://github.com/valums/file-uploader, a javascript library which allows multiple file uploads with progress-bar and drag-and-drop support. It's well supported, highly configurable and the basic implementation is simple:
In the view:
<div id='file-uploader'><noscript><p>Please Enable JavaScript to use the file uploader</p></noscript></div>
In the js:
var uploader = new qq.FileUploader({
element: $('#file-uploader')[0],
action: 'files/upload',
onComplete: function(id, fileName, responseJSON){
// callback
}
});
When handed files, FileUploader posts them to the server as an XHR request where the POST body is the raw file data while the headers and filename are passed in the URL string (this is the only way to upload a file asyncronously via javascript).
This is where it gets complicated, since Paperclip has no idea what to do with these raw requests, you have to catch and convert them back to standard files (preferably before they hit your Rails app), so that Paperclip can work it's magic. This is done with some Rack Middleware which creates a new Tempfile (remember: Heroku is read only):
# Embarrassing note: This code was adapted from an example I found somewhere online
# if you recoginize any of it please let me know so I pass credit.
module Rack
class RawFileStubber
def initialize(app, path=/files\/upload/) # change for your route, careful.
#app, #path = app, path
end
def call(env)
if env["PATH_INFO"] =~ #path
convert_and_pass_on(env)
end
#app.call(env)
end
def convert_and_pass_on(env)
tempfile = env['rack.input'].to_tempfile
fake_file = {
:filename => env['HTTP_X_FILE_NAME'],
:type => content_type(env['HTTP_X_FILE_NAME']),
:tempfile => tempfile
}
env['rack.request.form_input'] = env['rack.input']
env['rack.request.form_hash'] ||= {}
env['rack.request.query_hash'] ||= {}
env['rack.request.form_hash']['file'] = fake_file
env['rack.request.query_hash']['file'] = fake_file
if query_params = env['HTTP_X_QUERY_PARAMS']
require 'json'
params = JSON.parse(query_params)
env['rack.request.form_hash'].merge!(params)
env['rack.request.query_hash'].merge!(params)
end
end
def content_type(filename)
case type = (filename.to_s.match(/\.(\w+)$/)[1] rescue "octet-stream").downcase
when %r"jp(e|g|eg)" then "image/jpeg"
when %r"tiff?" then "image/tiff"
when %r"png", "gif", "bmp" then "image/#{type}"
when "txt" then "text/plain"
when %r"html?" then "text/html"
when "js" then "application/js"
when "csv", "xml", "css" then "text/#{type}"
else 'application/octet-stream'
end
end
end
end
Later, in application.rb:
config.middleware.use 'Rack::RawFileStubber'
Then in the controller:
def upload
#foo = modelWithPaperclip.create({ :img => params[:file] })
end
This works reliably, though it can be a slow process when uploading a lot of files simultaneously.
DISCLAIMER
This was implemented for a project with a single, known & trusted back-end user. It almost certainly has some serious performance implications for a high traffic Heroku app and I have not fire tested it for security. That said, it definitely works.
The method Ryan Bigg recommends is here:
https://github.com/rails3book/ticketee/commit/cd8b466e2ee86733e9b26c6c9015d4b811d88169
https://github.com/rails3book/ticketee/commit/982ddf6241a78a9e6547e16af29086627d9e72d2
The file-uploader recommendation by Daniel Mendel is really great. It's a seriously awesome user experience, like Gmail drag-and-drop uploads. Someone wrote a blog post about how to wire it up with a rails app using the rack-raw-upload middleware, if you're interested in an up-to-date middleware component.
http://pogodan.com/blog/2011/03/28/rails-html5-drag-drop-multi-file-upload
https://github.com/newbamboo/rack-raw-upload
http://marc-bowes.com/2011/08/17/drag-n-drop-upload.html
There's also another plugin that's been updated more recently which may be useful
jQuery-File-Upload
Rails setup instructions
Rails setup instructions for multiples
And another one (Included for completeness. I haven't investigated this one.)
PlUpload
plupload-rails3
These questions are highly related
Drag-and-drop file upload in Google Chrome/Chromium and Safari?
jQuery Upload Progress and AJAX file upload
I cover this in Rails 3 in Action's Chapter 8. I don't cover uploading to S3 or resizing images however.
Recommending you buy it based solely on it fixing this one problem may sound a little biased, but I can just about guarantee you that it'll answer other questions you have down the line. It has a Behaviour Driven Development approach as one of the main themes, introducing you to Rails features during the development of an application. This shows you not only how you can build an application, but also make it maintainable.
As for the resizing of images after they've been uploaded, Paperclip's got pretty good documentation on that. I'd recommend having a read and then asking another question on SO if you don't understand any of the options / methods.
And as for S3 uploading, you can do this:
has_attached_file :photo, :styles => { ... }, :storage => :s3
You'd need to configure Paperclip::Storage::S3 with your S3 details to set it up, and again Paperclip's got some pretty awesome documentation for this.
Good luck!

Ruby on Rails 3: Streaming data through Rails to client

I am working on a Ruby on Rails app that communicates with RackSpace cloudfiles (similar to Amazon S3 but lacking some features).
Due to the lack of the availability of per-object access permissions and query string authentication, downloads to users have to be mediated through an application.
In Rails 2.3, it looks like you can dynamically build a response as follows:
# Streams about 180 MB of generated data to the browser.
render :text => proc { |response, output|
10_000_000.times do |i|
output.write("This is line #{i}\n")
end
}
(from http://api.rubyonrails.org/classes/ActionController/Base.html#M000464)
Instead of 10_000_000.times... I could dump my cloudfiles stream generation code in there.
Trouble is, this is the output I get when I attempt to use this technique in Rails 3.
#<Proc:0x000000010989a6e8#/Users/jderiksen/lt/lt-uber/site/app/controllers/prospect_uploads_controller.rb:75>
Looks like maybe the proc object's call method is not being called? Any other ideas?
Assign to response_body an object that responds to #each:
class Streamer
def each
10_000_000.times do |i|
yield "This is line #{i}\n"
end
end
end
self.response_body = Streamer.new
If you are using 1.9.x or the Backports gem, you can write this more compactly using Enumerator.new:
self.response_body = Enumerator.new do |y|
10_000_000.times do |i|
y << "This is line #{i}\n"
end
end
Note that when and if the data is flushed depends on the Rack handler and underlying server being used. I have confirmed that Mongrel, for instance, will stream the data, but other users have reported that WEBrick, for instance, buffers it until the response is closed. There is no way to force the response to flush.
In Rails 3.0.x, there are several additional gotchas:
In development mode, doing things such as accessing model classes from within the enumeration can be problematic due to bad interactions with class reloading. This is an open bug in Rails 3.0.x.
A bug in the interaction between Rack and Rails causes #each to be called twice for each request. This is another open bug. You can work around it with the following monkey patch:
class Rack::Response
def close
#body.close if #body.respond_to?(:close)
end
end
Both problems are fixed in Rails 3.1, where HTTP streaming is a marquee feature.
Note that the other common suggestion, self.response_body = proc {|response, output| ...}, does work in Rails 3.0.x, but has been deprecated (and will no longer actually stream the data) in 3.1. Assigning an object that responds to #each works in all Rails 3 versions.
Thanks to all the posts above, here is fully working code to stream large CSVs. This code:
Does not require any additional gems.
Uses Model.find_each() so as to not bloat memory with all matching objects.
Has been tested on rails 3.2.5,
ruby 1.9.3 and heroku using unicorn, with single dyno.
Adds a GC.start at every 500 rows, so as not to blow the heroku dyno's
allowed memory.
You may need to adjust the GC.start depending on your Model's memory footprint. I have successfully used this to stream 105K models into a csv of 9.7MB without any problems.
Controller Method:
def csv_export
respond_to do |format|
format.csv {
#filename = "responses-#{Date.today.to_s(:db)}.csv"
self.response.headers["Content-Type"] ||= 'text/csv'
self.response.headers["Content-Disposition"] = "attachment; filename=#{#filename}"
self.response.headers['Last-Modified'] = Time.now.ctime.to_s
self.response_body = Enumerator.new do |y|
i = 0
Model.find_each do |m|
if i == 0
y << Model.csv_header.to_csv
end
y << sr.csv_array.to_csv
i = i+1
GC.start if i%500==0
end
end
}
end
end
config/unicorn.rb
# Set to 3 instead of 4 as per http://michaelvanrooijen.com/articles/2011/06/01-more-concurrency-on-a-single-heroku-dyno-with-the-new-celadon-cedar-stack/
worker_processes 3
# Change timeout to 120s to allow downloading of large streamed CSVs on slow networks
timeout 120
#Enable streaming
port = ENV["PORT"].to_i
listen port, :tcp_nopush => false
Model.rb
def self.csv_header
["ID", "Route", "username"]
end
def csv_array
[id, route, username]
end
It looks like this isn't available in Rails 3
https://rails.lighthouseapp.com/projects/8994/tickets/2546-render-text-proc
This appeared to work for me in my controller:
self.response_body = proc{ |response, output|
output.write "Hello world"
}
In case you are assigning to response_body an object that responds to #each method and it's buffering until the response is closed, try in in action controller:
self.response.headers['Last-Modified'] = Time.now.to_s
Just for the record, rails >= 3.1 has an easy way to stream data by assigning an object that respond to #each method to the controller's response.
Everything is explained here: http://blog.sparqcode.com/2012/02/04/streaming-data-with-rails-3-1-or-3-2/
Yes, response_body is the Rails 3 way of doing this for the moment: https://rails.lighthouseapp.com/projects/8994/tickets/4554-render-text-proc-regression
This solved my problem as well - I have gzip'd CSV files, want to send to the user as unzipped CSV, so I read them a line at a time using a GzipReader.
These lines are also helpful if you're trying to deliver a big file as a download:
self.response.headers["Content-Type"] = "application/octet-stream"
self.response.headers["Content-Disposition"] = "attachment; filename=#{filename}"
In addition, you will have to set the 'Content-Length' header by your self.
If not, Rack will have to wait (buffering body data into memory) to determine the length.
And it will ruin your efforts using the methods described above.
In my case, I could determine the length.
In cases you can't, you need to make Rack to start sending body without a 'Content-Length' header.
Try to add into config.ru "use Rack::Chunked" after 'require' before the 'run'. (Thanks arkadiy)
I commented in the lighthouse ticket, just wanted to say the self.response_body = proc approach worked for me though I needed to use Mongrel instead of WEBrick to succeed.
Martin
Applying John's solution along with Exequiel's suggestion worked for me.
The statement
self.response.headers['Last-Modified'] = Time.now.to_s
marks the response as non-cacheable in rack.
After investigating further, I figured one could also use this :
headers['Cache-Control'] = 'no-cache'
This, to me, is just slightly more intuitive. It conveys the message to any1 else who may be reading my code. Also, in case a future version of rack stops checking for Last-Modified , a lot of code may break and it may be a while for folks to figure out why.

Rails Binary Stream support

I'm going to be starting a project soon that requires support for large-ish binary files. I'd like to use Ruby on Rails for the webapp, but I'm concerned with the BLOB support. In my experience with other languages, frameworks, and databases, BLOBs are often overlooked and thus have poor, difficult, and/or buggy functionality.
Does RoR spport BLOBs adequately? Are there any gotchas that creep up once you're already committed to Rails?
BTW: I want to be using PostgreSQL and/or MySQL as the backend database. Obviously, BLOB support in the underlying database is important. For the moment, I want to avoid focusing on the DB's BLOB capabilities; I'm more interested in how Rails itself reacts. Ideally, Rails should be hiding the details of the database from me, and so I should be able to switch from one to the other. If this is not the case (ie: there's some problem with using Rails with a particular DB) then please do mention it.
UPDATE: Also, I'm not just talking about ActiveRecord here. I'll need to handle binary files on the HTTP side (file upload effectively). That means getting access to the appropriate HTTP headers and streams via Rails. I've updated the question title and description to reflect this.
As for streaming, you can do it all in an (at least memory-) efficient way. On the upload side, file parameters in forms are abstracted as IO objects that you can read from; on the download side, look in to the form of render :text => that takes a Proc argument:
render :content_type => 'application/octet-stream', :text => Proc.new {
|response, output|
# do something that reads data and writes it to output
}
If your stuff is in files on disk, though, the aforementioned solutions will certainly work better.
+1 for attachment_fu
I use attachment_fu in one of my apps and MUST store files in the DB (for annoying reasons which are outside the scope of this convo).
The (one?) tricky thing dealing w/BLOB's I've found is that you need a separate code path to send the data to the user -- you can't simply in-line a path on the filesystem like you would if it was a plain-Jane file.
e.g. if you're storing avatar information, you can't simply do:
<%= image_tag #youruser.avatar.path %>
you have to write some wrapper logic and use send_data, e.g. (below is JUST an example w/attachment_fu, in practice you'd need to DRY this up)
send_data(#youruser.avatar.current_data, :type => #youruser.avatar.content_type, :filename => #youruser.avatar.filename, :disposition => 'inline' )
Unfortunately, as far as I know attachment_fu (I don't have the latest version) does not do clever wrapping for you -- you've gotta write it yourself.
P.S.
Seeing your question edit - Attachment_fu handles all that annoying stuff that you mention -- about needing to know file paths and all that crap -- EXCEPT the one little issue when storing in the DB. Give it a try; it's the standard for rails apps. IF you insist on re-inventing the wheel, the source code for attachment_fu should document most of the gotchas, too!
You can use the :binary type in your ActiveRecord migration and also constrain the maximum size:
class BlobTest < ActiveRecord::Migration
def self.up
create_table :files do |t|
t.column :file_data, :binary, :limit => 1.megabyte
end
end
end
ActiveRecord exposes the BLOB (or CLOB) contents as a Ruby String.
I think your best bet is the attachment_fu plug-in:
http://github.com/technoweenie/attachment_fu/tree/master
UPDATE: Found some more info here http://groups.google.com/group/rubyonrails-talk/browse_thread/thread/a81beffb93708bb3
Look into the plugin, x_send_file too.
"The XSendFile plugin provides a simple interface for sending files via the X-Sendfile HTTP header. This enables your web server to serve the file directly from disk, instead of streaming it through your Rails process. This is faster and saves a lot of memory if you‘re using Mongrel. Not every web server supports this header. YMMV."
I'm not sure if it's usable with Blobs, it may just be for files on the file system. But you probably need something that doesn't tie up the web server streaming large chunks of data.

Resources