I have a Rails 3.2.13 site that needs to scrape another website to get a product description. What is the best way to do this in Rails 3?
I've heard that nokogiri is fast. Should I use nokogiri? And if I use nokogiri is it possible that I don't have to save the scraped data anymore? I imagine it as just like getting json data from an API, is it like that?
I'd recommend a combination of Nokogiri and open-uri. Require both gems, and then just do something along the lines of doc = Nokogiri::HTML(open(YOUR_URL)). Then find the element you want to capture (using developer tools in chrome (or the equivalent) or something like Selector Gadget. Then you can use doc.at_css(SELECTOR) for a single element, or doc.search(SELECTOR) for multiple selectors. Calling the text method the response should get you the product description you're looking for. No need to save anything to the database (unless you want to) Hope that helps!
mechanize is a wonderful gem for scraping data from other websites as html. It is simple, robust and using nokogiri gem as result wrapper.
the following snippet will show you how you can fetch needed data being seen as Safari browser from url:
require 'htmlentities'
require "mechanize"
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
#resultHash = {}
a.get(url) do |page|
parsedPage = page.parser
#resultHash[:some_data_name] = parsedPage.at_xpath("//h1[#class='any-class']").text.split(/\s+/).join(" ")
end
Related
I ideally want to access the API from this website, but since I am struggling to do that, I have decided to try and scrape the page instead. I am starting at this page:
https://fantasy.sixnationsrugby.com/#/welcome/login
Where I plan to log in and then scrape the data.
The code I have below seems to work for every other website I test with, apart from this one. And I can't seem to pull anything, no text, forms, etc literally nothing works? As an example I just want to scrape the main header title 'Let's Go! Log in to your account'
def scrape
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://fantasy.sixnationsrugby.com/#/welcome/login')
header_title = page.search('div.fs-box-header-title').text.strip
#output = header_title
end
Is it something to do with how the page is rendered? Thanks
I need to read the GBP rate from this javascript file: http://cdn.shopify.com/s/javascripts/currencies.js. I want to be able to get the js variable as JSON so that I can easily access the variable I need with its index. I tried a couple of ways as follows with eventually no success.
Way 1
Source: https://docs.ruby-lang.org/en/2.0.0/Net/HTTP.html
My code:
uri = URI('http://cdn.shopify.com/s/javascripts/currencies.js')
#response = Net::HTTP.get(uri) # => String
Result: I get the result as a string and reading the GBP rate from the string is difficult and probably not the correct way.
Way 2
Source: curl request in ruby
My Code:
url = 'http://cdn.shopify.com/s/javascripts/currencies.js'
mykey = 'demo'
uri = URI(url)
request = Net::HTTP::Get.new(uri.path)
request['Content-Type'] = 'application/xml'
request['Accept'] = 'application/xml'
request['X-OFFERSDB-API-KEY'] = mykey
#response = Net::HTTP.new(uri.host,uri.port) do |http|
http.request(request)
end
Result: This returns me Net::HTTP:0x007f2480874050 which looks like a memory address, definitely not what I want.
In addition, I've included require 'net/http', require 'json' in my controller in either case.
I am very new to Ruby and I don't know how to figure this out. So looking for someone who can help.
This is a bit of a weird request, IMO, but Rails can do it. Rails comes with a library called execjs automatically, which lets you run javascript from ruby. So, you have some javascript you want to run in that file, but you also want to return specific key from that javascript, so something like this should do it:
# Expanding upon 'Way 1', which got you the javascript as a string
uri = URI('http://cdn.shopify.com/s/javascripts/currencies.js')
response = Net::HTTP.get(uri)
gbp_rate = ExecJS.exec "#{response}; return Currency.rates.GBP;"
p gbp_rate # => 1.40045
I just want to reiterate (from their FAQ in the README) though:
Can ExecJS be used to sandbox scripts?
No, ExecJS shouldn't be used for any security related sandboxing. Since runtimes are automatically detected, each runtime has different sandboxing properties. You shouldn't use ExecJS.eval on any inputs you wouldn't feel comfortable Ruby eval()ing.
This file looks safe, but just keep it in mind, you are actually executing this javascript.
Personally, I would look to see if there's an API somewhere that can give you this value more easily, or if it doesn't change often (I have never used Shopify so don't know how much this changes) just hardcode it in the app as a config value and update it manually. Just feels cleaner, to me.
I am currently working on a Rails app.
I want to go to a website(http://alt19.com/) and select a set of options, then click a button which triggers the download of a CSV file. Then I want to take the file and parse it.
I have found a gem for parsing CSV files.
However, I don't know if there is a gem for navigating to another website, selecting a set of options, downloading several files and saving them somewhere where my app can process them.
Is there anything like this?
If not, are there any alternative solutions?
You can use mechanize gem to scrap the page. Mechanize uses nokogiri as one of the dependency which is responsible for scraping and mechanize has added feature of clicking elements from the page.
As you can see the CSV Generator from makes a post with some params.
Just do the same with 'net/https' and 'open_uri'
Example :
require "uri"
require "net/http"
params = {'box1' => 'Nothing is less important than which fork you use. Etiquette is the science of living. It embraces everything. It is ethics. It is honor. -Emily Post',
'button1' => 'Submit'
}
x = Net::HTTP.post_form(URI.parse('http://www.interlacken.com/webdbdev/ch05/formpost.asp'), params)
puts x.body
Example source: Submitting POST data from the controller in rails to another website
I search links via css form page = agent.get('http://www.print-index.ru/default.aspx?p=81&gr=198') and after that I have in page variable a lot of links but I don't know how use them, how click on them via Mechanize. I found on stackoverflow this method:
page = agent.get "http://google.com"
node = page.search ".//p[#class='posted']"
Mechanize::Page::Link.new(node, agent, page).click
but it works for only one link so how can I use this method for many links.
If I should post additional information, please say it.
If your goal is simply to make it to the next page and then scrape some info off of it, then all you really care about are:
Page content (For scraping your data)
The URL to the next page you need to visit
The way you get to the page content could be done by using Mechanize OR something else, like OpenURI (which is part of Ruby standard lib). As a side note, Mechanize uses Nokogiri behind the scenes; when you start to dig into elements on the parsed page you will see they come back as Nokogiri related objects.
Anyways, if this were my project I'd probably go the route of using OpenURI to get at the page's content and then Nokogiri to search it. I like the idea of using a Ruby standard library instead of requiring an additional dependency.
Here is an example using OpenURI:
require 'nokogiri'
require 'open-uri'
printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content
# ....
# Do something...
# ....
Here's an example using Mechanize to get the page content (they are very similar):
require 'mechanize'
agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = agent.get(about_project_link_in_navbar_menu_url)
# ....
# Do something...
# ....
PS I used google to translate Russian to english.. if the variable names are incorrect, i'm sorry! :X
Does anyone have an idea on how to implement this (http://railscasts.com/episodes/256-i18n-backends) with MongoDB/Mongoid? My question is primarily about the initializer.rb file.
The docs of Mongo-I18n on github suggests the following using its 'MongoI18n::Store.new' method:
collection = Mongo::Connection.new['my_app_related_db'].collection('i18n')
I18n.backend = I18n::Backend::KeyValue.new(MongoI18n::Store.new(collection)
But how to do this if you don't want to use their plugin? Is there something like a Mongo::Store method?
I just did this exact same thing, except that I had trouble installing Mongo-I18n, because it has a dependency on a very old version of MongoDB.
To get around this, I copied the code from here into lib/mongo_i18n.rb.
You were on the right track with your initializer though, if you're using Mongoid - the best way forward is to do this:
require 'mongo_i18n'
collection = Mongoid.database.collection('i18n')
I18n.backend = I18n::Backend::KeyValue.new(MongoI18n::Store.new(collection))
Which tells the I18n backend to use a new collection (called i18n), but in the same database as the rest of your application.
Make sure you delete the Mongo_I18n gem out of your gemfile and run bundle before starting your server again.
You can access your store directly using:
I18n.backend.store
But to make it a little cleaner, I added this method to my I18n library:
# mongo_i18n.rb
def self.store
collection = Mongoid.database.collection('i18n')
MongoI18n::Store.new
end
So that I can access the store directly with:
MongoI18n.store
I did exactly like theTRON said, except that instead of require 'mongo_i18n' I added whole class MongoI18n::Store definition from Mongo_i18n gem directly to mongo initializer. It not such a big deal, because whole MongoI18n::Store is 41 lines long. Look here, why make dependancy from 41 lines gem ?