how to consume twitter/datasift stream with rails on heroku - ruby-on-rails

How would one consume a streaming api (like the twitter streaming api) with rails on heroku? Would it involve keeping a script running with a worker that consumes the stream? If there are any existing resources that document this please link, I have not been able to find much so far.

Your two options are to use a worker dyno to run a script that consumes the stream and writes it to a data store (your database etc.), or to fetch parts of the stream on the fly in your rails application as part of your response to HTTP requests.
Which one of those makes sense for you depends on what you are trying to do with the data and how much of the stream you need.

Sorry for the soft answer, none of this code or ideas are my own...
The easiest way to consume a streaming API without using a background worker on Heroku is to use EventMachine
In a model, you'd do something like this:
EM.schedule do
http = EM::HttpRequest.new(STREAMING_URL).get :head => { 'Authorization' => [ 'USERNAME', 'PASSWORD' ] }
buffer = ""
http.stream do |chunk|
buffer += chunk
while line = buffer.slice!(/.+\r?\n/)
handle_tweet JSON.parse(line)
end
end
end
For more details have a look at Adam Wiggins, Joslyn Esser and Kenne Jima

Related

Rails avoid blocking worker in slow controller

Generally any DB/File IO even external HTTP requests are pretty quick, but I am finding slower ones can hold up all my workers (and memory limits how many Ruby instances I can run), and creating large numbers of threads per worker has other issues (with CPU or memory heavy actions clogging up the system).
Can I have Rails process these actions in an async manner (more like NodeJS) or else introduce threads for that action in some way?
Since I want to respond to the original request, neither workers or just spawning another thread myself seems appropriate, since Rails will ensure the original thread sends a response when it returns from the controller.
def my_action
#data1 = get_data("https://slow.com/data") #e.g. Net::HTTP
#data2 = get_data("https://slow.com/data2?group_id=#{#data["id"]}")
render
end
def my_action
get_data("https://slow.com/data").then do |data1| # e.g. internal thread, not sure on other options
get_data("https://slow.com/data2?group_id=#{data["id"]}").then do |data2|
#data1 = data1
#data2 = data2
render # Appears to have no effect
end
end
# Rails does an implicit "render" on return
end
def my_action
Thread.new do # explicit thread just for this request
#data1 = get_data("https://slow.com/data")
#data2 = get_data("https://slow.com/data2?group_id=#{#data["id"]}")
render
end
end
In a Rails application, you're better off relying on an external process to run background jobs rather than using Ruby Threads.
Sidekiq is a pretty standard gem now for this purpose.
If it takes 10 seconds to process a request, and you want to send your response to the original HTTP request, then you've got to hold open that HTTP connection for 10 seconds. You can't get around that. If your server can handle X HTTP connections, and you have X+1 people making these slow requests... someone is going to get blocked.
There are only three possible solutions:
Figure out a way to process the requests faster. This is ideal, if you can do it.
Don't hold open the HTTP connection. Run a background task (using Sidekiq or similar gem) to do the work. When it's done, send it via websocket, or have the client poll for it. It makes your API more complicated for the client, but as a client I'd rather deal with a little complexity than having my requests blocked and maybe time out.
Scale up your server until it can handle the traffic. This is the "throw money at the problem" solution. I generally disapprove of this, since you'll have to keep throwing more money every time demand grows. But if your organization has more money than dev time, it might work for a while.
Those are your options.

Rails: Read the contents from another site

In Rails, how can I make an http request to a page, like "http://google.com" and set the response to a variable?
Basically I'm trying to get the contents of a CSV file off of Amazon S3:
https://s3.amazonaws.com/datasets.graf.ly/24.csv
My Rails server needs to return that content as a response to an AJAX request.
Get S3 bucket
Access the file and read it
Render its contents (so the ajax request receives it)
A few questions have suggested screen scraping, but this sounds like overkill (and probably slow) for simply taking a response and pretty much just passing it along.
API
Firstly, you need to know how you're accessing the data
The problems you've cited are only valid if you just access someone's site through HTTP (with something like CURL). As you instinctively know, this is highly inefficient & will likely get your IP blocked for continuous access
A far better way to access data (from any reputable service) is to use their API. This is as true of S3 as Twitter, Facebook, Dropbox, etc:
AWS-SDK
#GemFile
gem "aws-sdk-core", "~> 2.0.0.rc2"
#config/application.rb
Aws.config = {
access_key_id: '...',
secret_access_key: '...',
region: 'us-west-2'
}
#config/initializers/s3.rb
S3 = Aws::S3.new
S3 = Aws.s3
Then you'll be able to use the API resources to help retrieve objects:
#controller
# yields once per response, even works with non-paged requests
s3.list_objects(bucket:'aws-sdk').each do |resp|
puts resp.contents.map(&:key)
end
CORS
If you were thinking of xhring into a server, you need to ensure you have the correct CORS permissions to do so
Considering you're wanting to use S3, I would look at this documentation to ensure you set the permissions correctly. This does not apply to the API or an HTTP request (only Ajax)
To do as you asked:
the open-uri solution from
How make a HTTP request using Ruby on Rails?
(to read from https in the simplest way possible), and
the set headers solution from in rails, how to return records as a csv file,
and
and a jquery library to decode the csv, eg http://code.google.com/p/jquery-csv/ -
Alternatively decode the csv file in rails and pass a json array of arrays back:
decode the csv as suggested in Rails upload CSV file with header
return the decoded data with the appropriate type
off the top of my head it should be something like:
def get_csv
url = 'http://s3.amazonaws.com/datasets.graf.ly/%d.csv' % params[:id].to_i
data = open(url).read
# set header here
render :text => data
end

Streaming Download while File is Created

I was wondering if anyone knows how to stream a file download while its being created at the same time.
I'm generating a huge CSV export and as of right now it takes a couple minutes for the file to be created. Once its created the browser then downloads the file.
I want to change this so that the browser starts downloading the file while its being created. Looking at this progress bar users will be more willing to wait. Even though it would tell me there an “Unknown time remaining” I’m less likely to get impatient since I know data is being steadily downloaded.
NOTE: Im using Rails version 3.0.9
Here is my code:
def users_export
File.new("users_export.csv", "w") # creates new file to write to
#todays_date = Time.now.strftime("%m-%d-%Y")
#outfile = #todays_date + ".csv"
#users = User.select('id, login, email, last_login, created_at, updated_at')
FasterCSV.open("users_export.csv", "w+") do |csv|
csv << [ #todays_date ]
csv << [ "id","login","email","last_login", "created_at", "updated_at" ]
#users.find_each(:batch_size => 100 ) do |u|
csv << [ u.id, u.login, u.email, u.last_login, u.created_at, u.updated_at ]
end
end
send_file "users_export.csv",
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=#{#outfile}",
:stream => true,
end
I sought an answer to this question several weeks ago. I thought that if data was being streamed back to the client then maybe Heroku wouldn't time out one of my long running API calls after 30 seconds. I even found an answer that looked promising:
format.xml do
self.response_body =
lambda { |response, output|
output.write("<?xml version='1.0' encoding='UTF-8' ?>")
output.write("<results type='array' count='#{#report.count}'>")
#report.each do |result|
output.write("""
<result>
<element-1>Data-1</element-1>
<element-2>Data-2</element-2>
<element-n>Data-N</element-n>
</result>
""")
end
output.write("</results>")
}
end
The idea being that the response_body lambda will have direct access to the output buffer going back to the client. However, in practice Rack has its own ideas about what data should be sent back and when. Furthermore this response_body as lambda pattern is deprecated in newer versions of rails and I think support is dropped outright in 3.2. You could get your hands dirty in the middleware stack and write this output as a Rails Metal but......
If I may be so bold, I strongly suggest refactoring this work to a background job. The benefits are many:
Your users will not have to just sit and wait for the download. They can request a file and then browse away to other more exciting portions of your site.
The file generation and download will be more robust, for example, if a user loses internet connectivity, even briefly, on minute three of a download under the current setup, they will lose all that time and need to start over again. If the file is being generated in the background on your site, they only need internet for as long as it takes to get the job started.
It will decrease the load on your front-end processes and may decrease the load on your site in total if the background job generates the files and you provide links to the generated files on a page within your app. Chances are one file generation could serve several downloads.
Since practically all Rails web servers are single threaded and synchronous out of the box, you will have an entire app server process tied up on this one file download for each time a user requests it. This makes it easy for users to accidentally carry out a DoS attack on your site.
You can ship the background generated file to a CDN such as S3 and perhaps gain a performance boost on the download speed your users see.
When the background process is done you can notify the user via email so they don't even have to be at the computer where they initiated the file generation in order to know it's done.
Once you have a background job system in your application you will find many more uses for it, such as sending email or updating search indexing.
Sorry that this doesn't really answer your original question. But I strongly believe this is a better overall solution.

How to do parallel HTTP requests in Heroku?

I'm building a Ruby on Rails app that access about 6-7 APIs, grabs information from them based on user's input, compares and display results to the users (the information is not saved in the database). I will be using Heroku to deploy the app. I would like those HTTP requests to access the APIs to be done in parallel so the answer time is better instead of doing it sequential. What do you think is the best way to achieve this in Heroku?
Thank you very much for any suggestions!
If you want to actually do the requests on the server side (tfe's javascript solution is a good idea), your best bet would be using EventMachine. Using EventMachine gives a simple way to do non-blocking IO.
Also check out EM-Synchrony for a set of Ruby 1.9 fiber aware clients (including HTTP).
All you need to do for a non-blocking HTTP request is something like:
require "em-synchrony"
require "em-synchrony/em-http"
EM.synchrony do
concurrency = 2
urls = ['http://url.1.com', 'http://url2.com']
# iterator will execute async blocks until completion, .each, .inject also work!
results = EM::Synchrony::Iterator.new(urls, concurrency).map do |url, iter|
# fire async requests, on completion advance the iterator
http = EventMachine::HttpRequest.new(url).aget
http.callback { iter.return(http) }
http.errback { iter.return(http) }
end
p results # all completed requests
EventMachine.stop
end
Goodluck!
You could always make the requests client-side using Javascript. Then not only can you run them in parallel, but you won't even need the round-trip to your own server.
I haven't tried parallelizing requests like that. But I've tried parallel on heroku, works like a charm! This is my simple blog post about it.
http://olemortenamundsen.wordpress.com/2010/10/17/spawning-multiple-threads-at-heroku-using-parallel/
Have a look at creating each request as a background job:
http://blog.heroku.com/archives/2009/7/15/background_jobs_with_dj_on_heroku/
The more 'Workers' you buy from Heroku, the more background jobs can be processed concurrently, leaving your 'Dynos' to serve your users.

My web site need to read a slow web site, how to improve the performance

I'm writing a web site with rails, which can let visitors inputing some domains and check if they had been regiestered.
When user clicked "Submit" button, my web site will try to post some data to another web site, and read the result back. But that website is slow for me, each request need 2 or 3 seconds. So I'm worried about the performance.
For example, if my web server allows 100 processes at most, that there are only 30 or 40 users can visit my website at the same time. This is not acceptable, is there any way to improve the performance?
PS:
At first, I want to use ajax reading that web site, but because of the "cross-domain" problem, it doesn't work. So I have to use this "ajax proxy" solution.
It's a bit more work, but you can use something like DelayedJob to process the requests to the other site in the background.
DelayedJob creates separate worker processes that look at a jobs table for stuff to do. When the user clicks submit, such a job is created, and starts running in one of those workers. This off-loads your Rails workers, and keeps your website snappy.
However, you will have to create some sort of polling mechanism in the browser while the job is running. Perhaps using a refresh or some simple AJAX. That way, the visitor could see a message such as “One moment, please...”, and after a while, the actual results.
Rather than posting some data to the websites, you could use an HTTP HEAD request, which (I believe) should return only the header information for that URL.
I found this code by googling around a bit:
require "net/http"
req = Net::HTTP.new('google.com', 80)
p req.request_head('/')
This will probably be faster than a POST request, and you won't have to wait to receive the entire contents of that resource. You should be able to determine whether the site is in use based on the response code.
Try using typhoeus rather than AJAX to get the body. You can POST the domain names for that site to check using typhoeus and can parse the response fetched. Its extremely fast compared to other solutions. A snippet that i ripped from the wiki page from the github repo http://github.com/pauldix/typhoeus shows that you can run requests in parallel (Which is probably what you want considering that it takes 1 to 2 seconds for an ajax request!!) :
hydra = Typhoeus::Hydra.new
first_request = Typhoeus::Request.new("http://localhost:3000/posts/1.json")
first_request.on_complete do |response|
post = JSON.parse(response.body)
third_request = Typhoeus::Request.new(post.links.first) # get the first url in the post
third_request.on_complete do |response|
# do something with that
end
hydra.queue third_request
return post
end
second_request = Typhoeus::Request.new("http://localhost:3000/users/1.json")
second_request.on_complete do |response|
JSON.parse(response.body)
end
hydra.queue first_request
hydra.queue second_request
hydra.run # this is a blocking call that returns once all requests are complete
first_request.handled_response # the value returned from the on_complete block
second_request.handled_response # the value returned from the on_complete block (parsed JSON)
Also Typhoeus + delayed_job = AWESOME!

Resources