Several PhantomJS calls in a RoR application - ruby-on-rails

I have a RoR application that given a set of N URLs to parse, will perform N shell calls for a given PhantomJS (actually is a CasperJS) script.
So,
Right now I have something like this:
urls_to_parse = ['first.html', 'second.html',...]
urls_to_parse.each do |url|
parse_results = \`casperjs parse_urls.js '#{url}'\`
end
I have never done this before. Launching shell scripts from a RoR/Ruby application, so I am wondering if this is a good approach and what alternative may I have. So, why I use PhantomJS in combination with RoR?
I basically have an API (RoR app) that keeps receiving urls that need to be parsed. They need to be parsed in a headless browser manner. The page actually needs to be rendered (that's why I don't use Nokogiri or any other HTML parser).
I am concerned about putting this up to production performance wise, and before going forward I would like to know if I am doing this correctly, or I can do it in a better way.

It's possible I thought about doing the same thing, but even with a headless browser I would be really concerned about the speed and bandwidth your server is going to need to have. I use capser in conjuction with Python and it works very well for me. I read stdout spit back from firing the casper scripts, but I don't parse and scrape on the fly like you're talking about doing. I would imagine it's okay, but ideally you already have a cached database of results when people search. Maybe if it is a very very basic search you'll be okay, but I don't know.

Related

Rails based Image Scraper

I'm learning rails and would like to make an image scraper for 4chan. I don't really know where to start though, so I was wondering if anyone could point me towards anything that I could look into to make this happen or anything I could study to become familiar with image scraping.
Well, first of all, you don't need to make a Rails App only for this image-scraping functionality. You could just write a script that does this. For that I would suggest using Nokogiri. You then need to find the way 4chan displays images on it's pages (inspecting the page from your browser), how are they structured in order to be able to get to them.

What tools to use for a website with lots of "realtime" page updates (coming from a Rails background)?

We are planning to make a "large" website for I'd say 5000 up to many more users. We think of putting in lots of real time functionality, where data changes instantly propagate to all connected clients. New frameworks like Meteor and DerbyJS look really promising for this kind of stuff.
Now, I wonder if it is possible to do typical backend stuff like sending (bulk) emails, cleaning up the database, generating pdfs, etc. with those new frameworks. And in a way that is productive and doesn't suck. I also wonder how difficult it is to create complex forms with them. I got used to the convenient Rails view helpers and Ruby gems to handle those kind of things.
Meteor and DerbyJS are both quite new, so I do expect lots of functionality will be added in the near future. However, I also wonder if it might be a good idea to combine those frameworks with a "traditional" Rails app, that serves up certain complex pages which do not need realtime updates. And/or with a Rails or Sinatra app that provides an API to do the heavy backend processing. Those Rails apps could then access the same databases then the Meteor/DerbyJS app. Anyone thinks this is a good idea? Or rather not? Why?
It would be nice if anyone with sufficient experience with those new "single page app realtime" frameworks could comment on this. Where are they heading towards? Will they be able to handle "complete" web apps with authentication and backend processing? Will it be as productive/convenient to program with them as with Rails? Well, I guess no one can know that for sure yet ;-) Well, any thoughts, guesses and ideas are welcome!
For things like sending bulk emails and generating PDFs, Derby let's you simply use normal Node.js modules. npm now has over 10,000 packages, so there are packages for most things you might want to do on the server. Derby doesn't control your server, and it works on top of any normal Express server. You should probably stick with Node.js code as much as possible and not use Rails along with Derby. That is not to say that you can't send messages to a separate Rails app, but since you already have to have a Node.js app running to host Derby, you might as well use it for stuff like this.
To communicate with such server-side code, you can use Derby's model events. We are still exploring how this kind of code works and we don't have a lot of examples, but it is something that we will have a clear story around. We are building an app ourselves that communicates with an email server, so we should have some real experience with this pretty soon.
You can also just use a normal AJAX request or send a message over Socket.IO manually if you don't want to use the Derby model to do this kind of communication. You are free to make your own server-side only routes with Express along with your Derby app routes. We think it is nice to have this kind of flexibility in case there are any use cases that we didn't properly anticipate with the framework.
As far as creating forms goes, Derby has a very powerful templating system, and I am working on making it a lot better still. We are working on a new UI components feature that will make it possible to build libraries of self-contained UI widgets that can simply be dropped into a Derby app while still playing nicely with automatic view-model bindings and data syncing. Once this feature is completed, I think form component libraries will be written rather quickly.
We do expect to include all of the features needed for a normal app, much like Rails does. It won't look like Rails or work like Rails, but it will be similarly feature complete eventually.
For backend tasks (such as sending emails, cleaning up the database, generating pdfs) it's better to use resque or sidekiq
Now, I wonder if it is possible to do typical backend stuff like
sending (bulk) emails, cleaning up the database, generating pdfs, etc.
with those new frameworks. And in a way that is productive and doesn't
suck. I also wonder how difficult it is to create complex forms with
them. I got used to the convenient Rails view helpers and Ruby gems to
handle those kind of things.
Also, my question is not only about background jobs, but also about stuff one can might do during a request, like generating a pdf, or simply rendering complex views with rails helpers or code from gems. –
You're mixing metaphors here - a single page app is just a site where the content is loaded without doing a full page reload, be that a front end in pure js or you could use normal html and pjax.
The kind of things you are describing would be done in a background task regardless of the fornt-end framework you used. But +1 for sidekiq if you're using ruby.
As for notifying all the other users of things that have changed, you can look into using http://pusher.com or http://pubnub.com if you don't want to maintain a websocket server.

Triggering FireWatir actions from different ruby scripts on the same browser window

I am writing an application that uses FireWatir to do a bunch of different actions. The problem is that I want to trigger these actions from many separate ruby files.
So for example, one ruby script will launch a new FireFox browser instance, than a totally different script will have that instance goto a specific website, and another will log into gmail.
I want all of these scripts to affect the same browser window. That way I can have one script take me to a specific website, and wait for another script to be triggered to do something else.
Please tell me that this is possible.
Chad,
I think that is possible. I am not sure that it's necessary or efficient, but I know that it's possible. The key is to make sure that you attach to the right browser instance. If you will only have one, that could be much simpler.
If you identify the problem that you are trying to solve with these multiple scripts then maybe one or more of the experienced framework designers can point you to existing solutions to the problem. There are some pretty awesome solutions that exist already. At the end of the day, we face the same issues.
Good luck,
Dave
I ended up getting around this issue by using socketing. Had a ruby script acting as the server that was waiting for requests from another group of ruby scripts that could be triggered whenever.

Best Practices for Optimizing Dynamic Page Load Times (JSON-generated HTML)

I have a Rails app where I load up a base HTML layout and I fill in the main content with rows of divs from JSON. This works in 2 steps:
Render the HTML
Ajax call to get the JSON
This has the benefit of being able to cache the HTML layout which doesn't change much, but it seems to have more drawbacks:
2 HTTP requests
HTML isn't that complex, the generated html is where all the work is done, so I'm not saving that much on time probably.
Each request in my specific case requires that we check the current user, their roles, and some things related to that user, so those 2 calls are somewhat involved.
Granted, memcached will probably solve a lot of this, I am wondering if there are some best practices here. I'm thinking I could do this:
Render the first page of JSON inline, in a script block, along with the HTML. This would cut out those 2 server calls requiring user authentication. And, assuming 80% of the time you don't need to make the second ajax call (pagination/sorting in this case), that seems like a fairly good solution.
What are your thoughts on how to approach this?
There are advantages and disadvantages to doing stuff like this. In general I'd say it's only a good idea, if whatever you're delaying via an ajax call would delay the page load enough to annoy the end user for most of the use cases on your page.
A good example of this is browsing a repository on github. 90% of the time all you want is to navigate the files, so they use an ajax load to fill in the commit messages per file after the page load.
It sounds like you're trying to do this to speed up or do something fancy for your users, but I think you should consider instead, what part is slow, and what speed of page load (and maybe for what information on that page) on your users are expecting. As you say, using memcached or fragment caching might well give you the improvements you're looking for.
Are you using some kind of monitoring tool? I'm using the free version of New Relic RPM on Heroku. It gives a lot of data on request times for individual controller actions. Data like that could help you focus your optimization process.

Getting Started with Ruby & Ruby on Rails

Some background:
I'm a jack-of-all trades, one of which is programming. I learned VB6 through Excel and PHP for creating websites and so far it's worked out just fine for me. I'm not CS major or even mathematically inclined - logic is what interests me.
Current status:
I'm willing to learn new and more powerful languages; my first foray into such a route is learning Ruby. I went to the main Ruby website and did the interactive intro. (by the way, I'm currently getting redirected to google.com when I try the link...it's happening to other websites as well...is my computer infected?)
I liked what I learned and wanted to get started using Ruby to create websites. I downloaded InstantRails and installed it; everything so far has been fine - the program starts up just fine, and I can test some Ruby code in the console. However my troubles begin when I try and view a web page with Ruby code present.
Lastly, my problem:
As in PHP, I can browse to the .php file directly and through using PHP tags and some simple 'echo' statements I can be on my way in making dynamic web pages. However with the InstantRails app working, accessing a .rb or .rhtml page doesn't produce similar results. I made a simple text file named 'test.rb' and put basic HTML tags in there (html, head, body) and the Ruby tags <%= and %> with some ruby code inside. The web page actually shows the tags and the code - as if it's all just plain HTML. I take it Ruby isn't parsing the page before it is displayed to the user, but this is where my lack of understanding of the Ruby environment stops me short. Where do I go from here?
AMMENDMENT: This tutorial has helped me immensely! I'd suggest anyone who's in my position go there.
First of all, you must disconnect the relationship between files and URLs.
Rails uses an MVC approach, which is worlds-different from scripts-based approach like ASP/PHP
In classic PHP, you have something like this
Server> Server started, serving scripts from /usr/jake/example.com/htdocs/
User> Please give me /home.php, thanks!
Server> OK, /home.php is mapped to /usr/jake/example.com/htdocs/home.php
Server> Executing /usr/jake/example.com/htdocs/home.php
Server> OK, it prints out a "Hello World!", send that to the response.
User> Ok, /home.php shows "Hello World!"
However, most MVC framework (Rails included) goes something like this:
Server> Server started, initializing routing modules routes.rb
User> Please give me /home, thanks!
Server> OK, /home, per the routing module, is handled with action ShowHomepage() in controller FrontpageCtr
Server> Execute FrontPageCtr.ShowHomepage()
Ruby> FrontPageCtr.ShowHomepage() prints "Hello World!"
Server> OK, sending "Hello World!" down the pipes!
User> Ok, /home shows "Hello World!"
As you can see, there is no connection between what the user put into the addressbar and any script files
In a typical MVC framework, processing a request for any URL goes something like this:
Look in the Routing module (which in the case of rails is defined in routes.rb)
Routing module will then tells the server which "Controller" and "Action" should be used to handle the request.
Rails then creates the Controller and invokes the Action function whatever that might be
The result from the action then gets "Rendered", which, in this case, is supposedly rendering the .rhtml file as actual HTML... there are, of course, other kinds of results e.g. send the user to another URL and whatnot.
The result is then written out to the response stream and displayed by the user's browser.
In short: You must disconnect the notion of scripts and URL first. When you're building MVC websites, they are almost always NOT related in a way that most people understand.
With that in mind, you should be more comfortable learning Rails and MVC way of life.
I'm not a Rails pro so please correct me if I'm mistaken on any part.
I would suggest buying and working your way through Agile Web Development with Rails, an excellent book and a very practical way to learn both Ruby and Rails. It's available instantly in a variety of electronic formats, plus you can get paper copy if you prefer that.
From what you describe you have a fundamentally flawed understanding of how Ruby and Rails, in particular, works. I suggest you spend some time with the book then come back and ask about anything that you get stumped on.
Rails is "parsing the page before it is displayed to the user", if you locate the right file to modify ;-) Those files to be modified are under the following folder(s):
app/views/...
That's the short answer. For a comprehensive one (for a newbie), I highly recommend: http://guides.rubyonrails.org/getting_started.html
Getting started with Ruby on Rails is something that is a little daunting at first, but after you get started it gets a lot easier. After running Ruby on Rails bootcamps for Startup Accelerators, Harvard Business School, in Times Square, Boston, and Pittsburgh, I started http://www.firehoseonline.com. It's a video tutorial to get started, so you should check out that site.
My advice is to learn as much as you can by actually writing the code. Don't get caught up too much in the details and the specifics. If a tutorial gives you some code to write, and some information, and you don't absorb all the information at first, keep going. Afterwards go back to the material, and once you have gone through the whole process of writing your first application a lot of the pieces will fit together.
As far as your question about opening the php files directly, using the MVC pattern is a little different. You need to setup a the controller, the views and the routes before you can start putting code into .rhtml (or now .html.erb) files. Because of this architecture you'll be able to write a lot of awesome, clean code, super fast, but it can be a bit tricky to wrap your head around (if you REALLY want to write code that way you can with other frameworks, but trust us that this way is better). Stick with it!
Keep your coding mojo high!
Aloha,
Ken

Resources