I want to copy some specific content from a website using ruby/rails.
The content I need is inside a marquee html tag, divided by divs.
How can I get access to this content using ruby?
To be more precise - I want to use some kind of ruby gui (Preferably shoes).
How do I do it?
This isn't really a Rails question. It's something you'd do using Ruby, then possibly display using Rails, or Sinatra or Padrino - pick your poison.
There are several different HTTP clients you can use:
Open-URI comes with Ruby and is the easiest. Net::HTTP comes with Ruby and is the standard toolbox, but it's lower-level so you'd have to do more work. HTTPClient and Typhoeus+Hydra are capable of threading and have both high-level and low-level interfaces.
I recommend using Nokogiri to parse the returned HTML. It's very full-featured and robust.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
puts doc.to_html
If you need to navigate through login screens or fill in forms before you get to the page you need to parse, then I'd recommend looking at Mechanize. It relies on Nokogiri internally so you can ask it for a Nokogiri document and parse away once Mechanize retrieves the desired URL.
If you need to deal with Dynamic HTML, then look into the various WATIR tools. They drive various web browsers then let you access the content as seen by the browser.
Once you have the content or data you want, you can "repurpose" it into text inside a Rails page.
If I'm to understand correctly, you want a GUI interface to a website scraper. If that's so, you might have to build one yourself.
The easiest way to scrape a website is using nokogiri or mechanize gems. Basically, you will give those libraries the address of the website and then use their XPath capabilities to select the text out of the DOM.
https://github.com/sparklemotion/nokogiri
https://github.com/sparklemotion/mechanize (for the documentation)
Related
I'm trying to submit input to the form, and parse the results in a RoR app. I've tried using mechanize, but it has some trouble with the way the page dynamically updates the results. It doesn't help that most fields are hidden.
Is there anyway to get mechanize to do what I'm looking for, or are there any alternatives to mechanize which I can use?
So whenever I want to do something like this, I go with the gem selenium-webdriver. It spawns a real browser (supports all major brands) and lets you control it with ruby code. You can do almost everything a real user could do. In addition, you have access to the (rendered) dom, so javascript generated content is not a problem.
Performance is much slower than with pure library clients, so its not a good fit for use in a web request cycle.
http://rubygems.org/gems/selenium-webdriver
I am writing a script that automates the completion a web form in my Rails app using the form entries given on the client side. However, this site uses Javascript, and so Mechanize is out of the question.
However, everything I've read about Mechanize's alternatives -- Watir Webdriver, Selenium, Capybara Webkit -- all focus seemingly exclusively on testing. However, my Rails web app would take in form entries from users, and then enter them using one of these tools into another website. For example, I would need to upload an image (ie :image) and enter in different text (ie :city) into form fields as part of this app, which would take the entries and enter them into the website.
So my first question: Can I use any Mechanize alternatives for something besides testing? And second: Can anyone refer to code examples on the web for non-testing usages of any of the above automators?
I don't have any concrete examples of javascript-enabled alternatives used in non-testing contexts, but I do have a suggestion: if you know the website that you will be submitting the form info to, it's probably better to find out what the javascript is doing and mimic that instead. Dig into the site's javascript code and figure out what type of data is being submitted to what URL, and just mimic that using standard HTTP operations -- skip the javascript rendering/interaction part altogether.
There is a lot of overhead incurred when rendering a page with javascript, which is why these tools (Watir, Selenium, Capybara and the like) are not generally used in actual client-facing application contexts.
Watir has a headless gem. You can give it a try watir headless
You should be able to use watir-webdriver to take the data (image, city) from one site and upload to other site. Below is brief code sample to help you get started.
require 'watir-webdriver'
$browser1 = Watir::Browser.new : chrome #You can use phantomjs for headless http://phantomjs.org/
$browser1.goto http://website1.com
city_field = $browser1.text_field (:id => 'city')
city = city_field.value
$browser2 = Watir::Browser.new : chrome
$browser2.goto http://website2.com
city_field_site2 = $browser2.text_field (:id => 'city')
city_field_site2.set city
I am working on a ROR application where I need to implement a crawler that crawls other sites and stores data in my database. For example suppose I want to crawl all deals from http://www.snapdeal.com and store them into my database. How to implement this using crawler?
There are couple of options depending upon your usecase.
Nokogiri. Here is the RailsCast that will get you started.
Mechanize is built on top of Nokogiri. See the Mechanize RailsCast.
Screen Scraping with ScrAPI and the ScrAPI RailsCast.
Hpricot.
I have used combination of Nokogiri and Mechanize for few of my projects and I think they are good options.
You want to take a look at mechanize. Also from what you mention you probably don't need rails at all.
As Sergio commented, you retrieve pages, parse them, and follow their links. In your case, it sounds like you're more focused on "screen scraping" than crawling deep link networks, so a library like Scrubyt will be helpful (although progress on it has died out). You can also use a lower-level parsing-focused library like Nokogiri.
I'm working on a Rails project which will need to interface with multiple third-party APIs. I'm pretty new to Rails, and I've never done this before, so I'm lacking some basic information here. Specifically, What is the preferred Rails way of simply querying an external URL?
In the PHP world, it was cURL. You take whatever the resource URL is, throw cURL at it, and start processing the response, whether it be XML, JSON, etc.
So, what's the cURL equivalent in Rails? While we're at it, what is the preferred method of parsing XML and JSON responses? My instincts are to Google around for some Ruby gems to get the job done, but this is such a practical problem that I wouldn't be surprised if the Rails community had already worked out a tried-and-true solution to this kind of problem.
If it's of any contextual value, I plan to run these third-party API interactions as nightly cronjobs, probably all packaged up as custom rake tasks.
Thanks for sharing your expertise.
In a perfect world, a gem already exists for the API you want to use, and you would just use that. Otherwise, you have a few options:
ActiveResource might make sense for you depending on the complexity of the API you want to use. For example, here's an old (and no longer functional) example of using ActiveResource to connect to the Twitter API
Net::Http is lower-level, but certainly does the trick
open-uri is a wrapper for net/http
Curb uses libcurl to get things done
Parsing JSON is generally very straightforward. For XML, as stated in another answer, Nokogiri is probably the way to go.
for opening urls you can use open-uri
just
require 'open-uri'
file_handle = open("http://google.com/blah.xml")
to parse xml you can use Nokogiri
$ gem install nokogiri
document = Nokogiri::XML(file_handle)
document/"xpath/search"
very powerful library, can do all kinds of searching and modifying for both XML and HTML
same for html Nokogiri::HTML
there is also lots of JSOM support out there too
checkout Nokogiri also Hpricot is good for XML/HTML
for JSON in rails
parsed_json = ActiveSupport::JSON.decode(your_json_string)
parsed_json["results"].each do |longUrl, convertedUrl|
site = Site.find_by_long_url(longUrl)
site.short_url = convertedUrl["shortUrl"]
site.save
end
see this question:
How do I parse JSON with Ruby on Rails?
I am writing a site with more than one language, is there any easy way or technique to reduct the workload of changing which text to another language.I have an idea, but I don't know whether it is suitable or easy enough. I create a XML that contain all the text in my web, when the user change their language, my program will base on the language the user choose, and get the suitable tags from the XML, and fill in the page. What do you think? or maybe is there any more easy way to do?
(Assuming I am using RoR, if suggest any gems.)
Check out Rails Internationalization (I18n) API:
The Ruby I18n (shorthand for
internationalization) gem which is
shipped with Ruby on Rails (starting
from Rails 2.2) provides an
easy-to-use and extensible framework
for translating your application to a
single custom language other than
English or for providing
multi-language support in your
application.
I wrote a blog post about "Scoping By Locales". It suggests to put all your content in the database and when you want to fetch it, use the I18n API to set the user's locale to be their language and the fetcher will default to that language.