how to run multiple nokogiri screen scrape threads at once - ruby-on-rails

I have a website that requires using Nokogiri on many different websites to extract data. This process is ran as a background job using the delayed_job gem. However it takes around 3-4 seconds per page to run because it has to pause and wait for other websites to respond.
I am currently just running them by basically saying
Websites.all.each do |website|
# screen scrape
end
I would like to execute them in batches rather than one each so that I dont have to wait for a server response from every site (can take up to 20 seconds on occassion).
What would be the best ruby or rails way to do this?
Thanks for your help in advance.

You might want to check out Typhoeus which enables you to make parallel http requests.
I found a short blawg post here about using it with Nokogiri, but I haven't tried this myself.
Wrapped in a DJ, this should do the trick with little client-side latency.

You need to use delayed job. Check out this Railscasts.
Keep in mind most hosts charge for this type of thing.
You can also use the spawn plugin if you don't care about managing threads but it is much much easier!!!
This is literally all you need to do:
rails plugin/install https://github.com/tra/spawn.git
Then in your controller or model add the method
For example:
spawn do
#execute your code here :)
end
http://railscasts.com/episodes/171-delayed-job
https://github.com/tra/spawn

I'm using EventMachine to do something similar to this for a current project. There is a terrific plugin called em-http-request that allows you to make mutliple HTTP requests in parallel, as well as providing options for synchronising the responses.
From the em-http-request github docs:
EventMachine.run {
http1 = EventMachine::HttpRequest.new('http://google.com/').get
http2 = EventMachine::HttpRequest.new('http://yahoo.com/').get
http1.callback { }
http2.callback { }
end
So in your case, you could have
callbacks = []
Websites.all.each do |website|
callbacks << EventMachine::HttpRequest.new(website.url).get
end
callbacks.each do |http|
http.callback { }
end
Run your rails application with the thin webserver in order to get a functioning EventMachine loop:
bundle exec rails server thin
You'll also need the eventmachine and em-http-request gems. Good luck!

Related

Improve concurrency on a simple AJAX call in rails

I have created a simple ajax call with the following code:
controller.rb
def locations
sleep 1.2
some_data = [{"name"=> "chris", "age"=> "14"}]
render json: some_data
end
view.js
function getLocation() {
$.get('/location').success(function(data){console.log(data);});
}
$(".button").click(function() {getLocation();});
Routes.rb
get '/location' => 'controller#locations'
Note that the sleep 1.2 in the controller it is to prevent doing background jobs or database calls.
The screenshot below is from the devtools Network tab, it shows I have clicked the button 8 times and all the subsequent calls are stalled until the previous action is finished. I think it is due to Rails being single threaded? Will it be a different case if the server is made with NodeJS? And How can I achieve similar concurrency with Rails for similar AJAX calls?
Thanks!!
Actually, it is not due to Rails, but to the Rails server you are using. Some are single threaded, and others can be launched as multithreaded.
For instance, if you use Phusion passenger, you can configure it to run using several threads and so to improve the concurrency. You should look for Rails "server" comparisons instead of trying to find a solution or a problem with the Rails "framework".
Popular servers are Thin, Unicorn, Puma, Phusion passenger. The default development server is call Webrick.
There are a lot of other stackoverflow questions relating to the differences between servers so I think you should look into them.

em-websocket gem with Ruby on Rails

I started developing a web-socket based game using the em-websocket gem.
To test the application I start the server by running
$> ruby server.rb
and then I just open two browsers going directly to the html file (no web server) and start playing.
But now I want to add a web server, some database tables, an other Ruby on Rails based gems.
How an achieve communication between my web-socket server and my Ruby on Rails application? Should they run in the same server and run as a single process? Run in separate servers and communicate through AJAX?
I need to support authentication and other features like updating the database when a game is finished, etc.
Thanks in advance.
There is an issue created about this:
https://github.com/igrigorik/em-websocket/issues/21
Here is the deal. I also wanted to develop a websocket server client with ruby on rails framework. However ruby-on-rails is not very friendly with eventmachine. I have struggeled with having a websocket client, so I managed to copy/cut/paste with from existing lib, and end up with the following two escessential ones.
Em-Websocket server
https://gist.github.com/ffaf2a8046b795d94ba0
ROR friendly websocket client
https://gist.github.com/2416740
have the server code in script directory, the start like the following in ruby code.
# Spawn a new process and run the rake command
pid = Process.spawn("ruby", "web_socket_server.rb",
"--loglevel=debug", "--logfile=#{Rails.root}/log/websocket.log",
:chdir=>"#{Rails.root}/script") #,
:out => 'dev/null', :err => 'dev/null'
Process.detach pid # Detach the spawned process
Then your client can be used like this
ws = WebSocketClient.new("ws://127.0.0.1:8099/import")
Thread.new() do
while data = ws.receive()
if data =~ /cancel/
ws.send("Cancelling..")
exit
end
end
end
ws.close
I wish there is a good ROR friendly em-websocket client, but couldn't fine one yet.
Once you made server/client works well, auth. and database support must not be very different from other rails code. (I mean having client side with some auth/db restrictions)
I am working on a gem that may be helpful with your current use case. The gem is called websocket-rails and has been designed from the ground up to make using WebSockets inside of a Rails application drop dead simple. It is now at a stable release.
Please let me know if you find this helpful or have any thoughts on where it may be lacking.

How to generate a response in a long running synchronous request in Rails?

I'm having trouble figuring out how to do this using Rails, though it is probably cause I don't know the proper term for it.
I basically want to do this:
def my_action
sleep 1
# output something in the request, but keep it open
print '{"progress":15}'
sleep 3
# output something else, keep it open
print '{"progress":65}'
sleep 1
# append some more, and close the request
print '{"sucess":true}'
end
However I can't figure out how to do this. I basically want to replicate a slow internet connection.
I need to do this because I am scraping websites, which takes time, where I am 'sleeping' above.
Update
I'm reading this using iOS, so I don't want a websocket server, I think.
Maybe this is exactly what you're looking for:
Infinite streaming JSON from Rails 3.1
You probably want to do some reading around HTML5 WebSockets (there are backwards compatible hacks for older browsers) which let you push data to the client from the server.
Rails has a number of ways to implement a WebSockets server. This question gives some of the options Best Ruby on Rails WebSocket tool
If that would work on the server-side, how would you handle it on the client-side?
HTTP requests normally can just have one response (which may be chunked when using streaming, which wouldn't work in your case I think).
I guess you would either have to look into websockets or make separate requests for each step.

parallel asynchronous processing with callbacks in rails controller

I am making a rails app and I am wondering whether it is possible to setup an asynchronous/callback architecture in the controller layer. I am trying to do the following:
When a HTTP request is made to /my_app/foo, I want to asynchronously dish out two jobs - a naive ranking job and a complicated ranking job both of which rank 1000 posts - to several worker machines. I want to setup a callback method in the controller for each job which is called when the respective job is finished. If the complicated job does not return within X milliseconds, I want to return the output from the naive job. Otherwise, I want to return the output from the complicated job.
It is important to note that I want these jobs to performed in parallel. What is the best way to implement such a system in Rails? I am using Apache Phusion Passenger as my rails server if that helps.
Thanks.
Sounds like you should be using background jobs. In that case, when a request comes in, you would start / queue two jobs which would be picked up and processed by a worker, which acts independently of your Rails app.
Here a few links that could be of help:
https://www.ruby-toolbox.com/categories/Background_Jobs
http://railscasts.com/episodes/171-delayed-job
http://railscasts.com/episodes/243-beanstalkd-and-stalker
http://railscasts.com/episodes/271-resque
http://rubyrogues.com/queues-and-background-processing/
It's possible to issue several HTTP request asynchronously in Rails. However, it's impossible to make Rails event-driven.
In can send several HTTP request asynchronously with libraries such as Typhoeus. However, you might have concurrency issue if your timeout is too long.
Otherwise, you can try some event-driven web framework such as Cramp and Goliath. They are both based on EventMachine, so you can try em-http-request.
Try using rabbitmq where you can post a message on queue and expect the response in reply queue. The queue consumer can be even implemented in Scala for fastness. amqp gem would suffice what I am saying. Rails controller with amqp binding would be even more nice if possible(I am exploring that option having endpoints with amqp binding instead of http). That would solve enough no of problems

Automatically Refreshing Rails Metal In Development Mode

I am trying to develop a rails metal endpoint using sinatra, but is proving to be a pain because I have to restart the server every time I change the code. I am in Jruby and running from within a larger Java app. Is there an easy way to make this code refresh for every request?
Just because I like abstract abstraction, this is Ryan's code v2:
def every s
loop do
sleep s
yield
end
end
every 1 { `touch tmp/restart.txt` }
I don't think there is a way to automatically reload sinatra code, however:
If you were running passenger, you could try running in irb:
loop do
`touch tmp/restart.txt`
sleep(1)
end
Which will then tell the passenger instance to restart the application.

Resources