I'm trying to use mechanize to perform a simple search on my college's class schedule db. The following code returns nil, however it works logging into facebook and searching google (with diff url/params). What am I doing wrong?
I'm following the latest (great) railscast here. Mechanize documentation has been useful but I'm still puzzled. Thanks in advance for your suggestions!
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
Remove search from form.submit.search i.e. form.submit I'm guessing you're appending search to submit thinking that it has something to do with the value of the submit button i.e. search.
What you're code is doing IS successfully submitting the form. However you are calling the search method of the resulting page object with a nil argument. The search method expects a selector e.g. 'body div#nav_bar ul.links li' as an argument for it to return an array of elements that match that selector. Of course no elements will match a nil selector, hence the empty array.
Edit per your response:
Your code:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
What I tried and got to work:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit # <- No search method.
=> Insanely long array of HTML elements
The same code will not work with Google either:
require 'mechanize'
require 'nokogiri'
agent = WWW::Mechanize.new
agent.get("http://www.google.com")
form = agent.page.forms.last
form.q = "stackoverflow"
a = form.submit.search
b = form.submit
puts a
=> [] # <--- EMPTY!
puts b
#<WWW::Mechanize::Page
{url
#<URI::HTTP:0x1020ea878 URL:http://www.google.co.uk/search?hl=en&source=hp&ie=ISO-8859-1&q=stackoverflow&meta=>}
{meta}
{title "stackoverflow - Google Search"}
{iframes}
{frames}
{links
#<WWW::Mechanize::Page::Link
"Images"
"http://images.google.co.uk/images?hl=en&source=hp&q=stackoverflow&um=1&ie=UTF-8&sa=N&tab=wi">
#<WWW::Mechanize::Page::Link
"Videos"
…
The search method of a page object behaves like the search method of Nokogiri, in that it accepts a sequence of CSS selectors and/or XPath queries and returns an enumerable object of matching elements. e.g.
page.search('h3.r a.l', '//h3/a[#class="l"]')
The page returns a null result when it is queried through WWW::Mechanize.
I'm not sure if WWW::Mechanize can handle POSTING to this secure page.
"can't convert nil into String" means it can't show you in a text form what nothing is. It can't convert something from nothing.
It also might be a problem with the form and the script delay.
Try using curl for debugging, POSTing such as curl -d "occ_subject=chm" https://www.owens.edu/cgi-bin/class.pl, when I tried that it returned a page.
I think it's a problem with the secure page and the cgi script combined.
Related
I am try detect when a new blog post has been added to a blog. I am using mechanize for the scraping. Currently this is straight forward if you know the parent tags of a blog <article><header><h1>Blot Title here</h1></header></article> you can just do a diff of the titles you have now to the last time you checked. But I want to do this programatically. Is there a way to programmatically know what section or tags of a page that holds the titles of the blog posts without explicitly giving the heirachy of tags to the script?
Suppose there is a blog name blog.example.com. There are the post -
<article><header><h1>Blot Title here1</h1></header></article>
<article><header><h1>Blot Title here2</h1></header></article>
<article><header><h1>Blot Title here3</h1></header></article>
Using selector Gaget you will get an idea which css is responsible for the each article. To scrap the article you can use nokogiri or machanize gem.
Suppose macanize bot will visit blog.example.com and it will collect all the article and insert into your database.
require 'nokogiri'
require 'open-uri'
if 1==1
url = "http://www.eslemployment.com/country/esl-jobs-Vietnam.html"
doc = Nokogiri::HTML(open(url))
data = []
doc.css("#joblist td:nth-child(1) a").first(5).each do |titlecss|
country = "8"
jobtype = "1"
urlnext = titlecss.attr('href')
docnext = Nokogiri::HTML(open(urlnext))
docnext.css('#jobdescription div').remove
docnext.css('#detailjob , #job-summary').each do |detailscss|
docnext.css('#pagemsg h1').each do |titlenextcss|
data << JobPost.create(
:title => titlenextcss.text,
:jobslink => urlnext,
:description => detailscss.inner_html,
:country_id => country,
:job_type_id => jobtype
)
end
end
end
end
Here is a example of nokogiri gem. It collect the jobs from www.eslemployment.com . Now your question is how you can detected new article has added.
This code collect all the jobs from a page and added it into the database. I use here "distint" code into the model for this reason only new job will added to the database. no duplication job will added into the database. When new Jobs is added you can make a notification which job is added.
This is not effecient way. But it will work. Otherwise you can use the rss feed of that blog. This is the proper way to detect new post.
I'm using the Mechanize gem to automate interaction with a website form.
The site i'm trying to interact with is http://www.tastekid.com/like/books
I'm trying to automatically submit a string to query in the form and return the suggested books in an array.
Following the guide, i've pretty printed the page layout to find the form name, but, I am just finding a form with no name, nill:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
pp page
How do I enter a string, submit the form and return the results in the form of an array?
These answers feel a little cluttered to me, so let me try to make it simpler:
page = agent.get 'http://www.tastekid.com/like/books'
there's only one form, so:
form = page.form
form['q'] = 'twilight'
submit the form
page = form.submit
print the text from the a's
puts page.search('.books a').map &:text
Following the guide, you can get the form:
form = page.form
I didn't see a name on the form, and I actually got two forms back: one on the page and one hidden.
I called
form.fields.first.methods.sort #not the hidden form
and saw that I could call value on the form, so I set it as such:
form.fields.first.value = "Blood Meridian"
then I submitted and pretty printed:
page = agent.submit(form)
This should work for you!
You could use the form_with method to locate the form you want. For example:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
the_form_you_want = page.form_with(:id => "searchFrm") # form_with
the_form_you_want.q = 'No Country for Old Men'
page = agent.submit(the_form_you_want)
pp page
It looks like the book titles all have the same class attribute. To extract the book titles, use the links_with method and pass in the class as a locator:
arr = []
page.links_with(:class => "rsrc").each do |link|
arr << link.text
end
But #aceofbassgreg is right. You'll need to read up on the mechanize and nokogiri documentation...
I have some code to extract offers on eBay, but there are several result pages and I get only the results of the first page. How can I loop through several result pages?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
You have to aggregate the results from each page by pulling the link from the "next" button (which, inspecting the page, is at the css .botpg-next a) and loading it.
Something like this:
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
while (url) do
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
link = doc.css('.botpg-next a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end
I'm just looping until no "next" button is found, but you could change that to limit the loop to a given number of results.
I want to scrape Groupon deals using Nokogiri. I want to scrape all these deals at the following link:
http://www.groupon.com/getaways?d=travel_countmein
On top of that, I want to access each individual link and scrape the title and price. Conceptually, is there a way to code a single rake task to do this?
I understand that there needs to be a loop of some sort, but I don't know how to parse the url for each deal from the main getaway page.
I've already written a scraper for the title and price:
task :fetch_travel => :environment do
require 'nokogiri'
require 'open-uri'
url = "http://www.groupon.com/deals/ga-flamingo-conferences-resort-spa?c=all&p=0"
doc = Nokogiri::HTML(open(url))
title = doc.at_css("#content//h2/a").text
price = doc.at_css("#amount").text[/[0-9\.]+/]
link = doc.at_css("#content//h2/a")[:href]
desc = doc.at_css(".descriptor").text
Traveldeal.create(:title => title, :price => price, :url => link, :description => desc)
end
Figured out that this requires nested loop where the inner loop is the code above and the outer loop will parse each deal for the url to be used in the inner loop.
I have a model called Book, which has_many :photos (file attachments handled by paperclip).
I'm currently building a client which will communicate with my Rails app through JSON, using Paul Dix's Typhoeus gem, which uses libcurl.
POSTing a new Book object was easy enough. To create a new book record with the title "Hello There" I could do something as simple as this:
require 'rubygems'
require 'json'
require 'typhoeus'
class Remote
include Typhoeus
end
p Remote.post("http://localhost:3000/books.json",
{ :params =>
{ :book => { :title => "Hello There" }}})
My problems begin when I attempt to add the photos to this query. Simply POSTing the file attachments through the HTML form creates a query like this:
Parameters: {"commit"=>"Submit", "action"=>"create", "controller"=>"books", "book"=>{"title"=>"Hello There", "photo_attributes"=>[{"image"=>#<File:/var/folders/1V/1V8Kw+LEHUCKonqJ-dp3oE+++TI/-Tmp-/RackMultipart20090917-3026-i6d6b9-0>}]}}
And so my assumption is I'm looking to recreate the same query in the Remote.post call.
I'm thinking that I'm letting the syntax of the array of hashes within a hash get the best of me. I've been attempting to do variations of what I was expecting would work, which would be something like:
p Remote.post("http://localhost:3000/books.json",
{ :params =>
{ :book => { :title => "Hello There",
:photo_attributes => [{ :image => "/path/to/image/here" }] }}})
But this seems to concatenate into a string what I'm trying to make into a hash, and returns (no matter what I do in the :image => "" hash):
NoMethodError (undefined method `stringify_keys!' for "image/path/to/image/here":String):
But I also don't want to waste too much time figuring out what is wrong with my syntax here if this isn't going to work anyway, so I figured I'd come here.
My question is:
Am I on the right track? If I clear up this syntax to post an array of hashes instead of an oddly concatenated string, should that be enough to pass the images into the Book object?
Or am I approaching this wrong?
Actually, you can't post files over xhr, there a security precaution in javascript that prevents it from handling any files at all. The trick to get around this is to post the file to a hidden iframe, and the iframe does a regular post to the server, avoiding the full page refresh. The technique is detailed in several places, possibly try this one (they are using php, but the principle remains the same, and there is a lengthy discussion which is helpful):
Posting files to a hidden iframe