I'm using the Mechanize gem to automate interaction with a website form.
The site i'm trying to interact with is http://www.tastekid.com/like/books
I'm trying to automatically submit a string to query in the form and return the suggested books in an array.
Following the guide, i've pretty printed the page layout to find the form name, but, I am just finding a form with no name, nill:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
pp page
How do I enter a string, submit the form and return the results in the form of an array?
These answers feel a little cluttered to me, so let me try to make it simpler:
page = agent.get 'http://www.tastekid.com/like/books'
there's only one form, so:
form = page.form
form['q'] = 'twilight'
submit the form
page = form.submit
print the text from the a's
puts page.search('.books a').map &:text
Following the guide, you can get the form:
form = page.form
I didn't see a name on the form, and I actually got two forms back: one on the page and one hidden.
I called
form.fields.first.methods.sort #not the hidden form
and saw that I could call value on the form, so I set it as such:
form.fields.first.value = "Blood Meridian"
then I submitted and pretty printed:
page = agent.submit(form)
This should work for you!
You could use the form_with method to locate the form you want. For example:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
the_form_you_want = page.form_with(:id => "searchFrm") # form_with
the_form_you_want.q = 'No Country for Old Men'
page = agent.submit(the_form_you_want)
pp page
It looks like the book titles all have the same class attribute. To extract the book titles, use the links_with method and pass in the class as a locator:
arr = []
page.links_with(:class => "rsrc").each do |link|
arr << link.text
end
But #aceofbassgreg is right. You'll need to read up on the mechanize and nokogiri documentation...
Related
I am try detect when a new blog post has been added to a blog. I am using mechanize for the scraping. Currently this is straight forward if you know the parent tags of a blog <article><header><h1>Blot Title here</h1></header></article> you can just do a diff of the titles you have now to the last time you checked. But I want to do this programatically. Is there a way to programmatically know what section or tags of a page that holds the titles of the blog posts without explicitly giving the heirachy of tags to the script?
Suppose there is a blog name blog.example.com. There are the post -
<article><header><h1>Blot Title here1</h1></header></article>
<article><header><h1>Blot Title here2</h1></header></article>
<article><header><h1>Blot Title here3</h1></header></article>
Using selector Gaget you will get an idea which css is responsible for the each article. To scrap the article you can use nokogiri or machanize gem.
Suppose macanize bot will visit blog.example.com and it will collect all the article and insert into your database.
require 'nokogiri'
require 'open-uri'
if 1==1
url = "http://www.eslemployment.com/country/esl-jobs-Vietnam.html"
doc = Nokogiri::HTML(open(url))
data = []
doc.css("#joblist td:nth-child(1) a").first(5).each do |titlecss|
country = "8"
jobtype = "1"
urlnext = titlecss.attr('href')
docnext = Nokogiri::HTML(open(urlnext))
docnext.css('#jobdescription div').remove
docnext.css('#detailjob , #job-summary').each do |detailscss|
docnext.css('#pagemsg h1').each do |titlenextcss|
data << JobPost.create(
:title => titlenextcss.text,
:jobslink => urlnext,
:description => detailscss.inner_html,
:country_id => country,
:job_type_id => jobtype
)
end
end
end
end
Here is a example of nokogiri gem. It collect the jobs from www.eslemployment.com . Now your question is how you can detected new article has added.
This code collect all the jobs from a page and added it into the database. I use here "distint" code into the model for this reason only new job will added to the database. no duplication job will added into the database. When new Jobs is added you can make a notification which job is added.
This is not effecient way. But it will work. Otherwise you can use the rss feed of that blog. This is the proper way to detect new post.
I have some code to extract offers on eBay, but there are several result pages and I get only the results of the first page. How can I loop through several result pages?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
You have to aggregate the results from each page by pulling the link from the "next" button (which, inspecting the page, is at the css .botpg-next a) and loading it.
Something like this:
url = "http://www.ebay.de/sch/i.html?_nkw=Suzuki+DR+BIG&_sacat=131090&_odkw=Suzuki+DR+BIG&_osacat=0&_from=R40"
while (url) do
doc = Nokogiri::HTML(open(url))
doc.css(".dtl").each do |dtl|
puts dtl.at_css(".vip").text
end
link = doc.css('.botpg-next a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end
I'm just looping until no "next" button is found, but you could change that to limit the loop to a given number of results.
I've been working to pull dynamic data from last.fm using youpy's "lastfm" gem. Getting the data works great; however, rails doesn't seem to like the dynamic portion. Right now, I have added the code to a helper module called "HomeHelper" (generated during the creation of the rails app) found in the helper folder:
module HomeHelper
##lastfm = Lastfm.new(key, secret)
##wesRecent = ##lastfm.user.get_recent_tracks(:user => 'weskey5644')
def _album_art_helper
trackHash = ##wesRecent[0]
medAlbumArt = trackHash["image"][3]
if medAlbumArt["content"] == nil
html = "<img src=\"/images/noArt.png\" height=\"auto\" width=\"150\" />"
else
html = "<img src=#{medAlbumArt["content"]} height=\"auto\" width=\"150\" />"
end
html.html_safe
end
def _recent_tracks_helper
lfartist1 = ##wesRecent[0]["artist"]["content"]
lftrack1 = ##wesRecent[0]["name"]
lfartist1 = ##wesRecent[1]["artist"]["content"]
lftrack1 = ##wesRecent[1]["name"]
htmltrack = "<div class=\"lastfm_recent_tracks\">
<div class=\"lastfm_artist\"><p>#{lfartist1 = ##wesRecent[0]["artist"]["content"]}</p></div>
<div class=\"lastfm_trackname\"><p>#{lftrack1 = ##wesRecent[0]["name"]}</p></div>
<div class=\"lastfm_artist\"><p>#{lfartist2 = ##wesRecent[1]["artist"]["content"]}</p></div>
<div class=\"lastfm_trackname\"><p>#{lftrack2 = ##wesRecent[1]["name"]}</p></div>
</div>
"
htmltrack.html_safe
end
end
I created a partial for each and added them to my Index page:
<div class="album_art"><%= render "album_art" %></div>
<div id="nowplayingcontain"><%= render "recent_tracks" %></div>
Great, this gets the data I need and displays on the page like I want; however, it seems that when the song changes, according to last.fm, it doesn't on my site unless I restart the server.
I've tested this using Phusion Gassenger and also WEBrick and it seems to do it on both. I had thought this might be an issue with caching of this particular page so I tried a couple of caching hacks to expire the page an reload. This didn't help.
I then came to conclusion that sticking this code in a helper file might not be the best solution. I don't know how well helpers handle dynamic content; such as this. If anyone has any insight on this, awesome!! Thanks everyone!
Your problem isn't that you're using a helper, the problem is that you're using class variables:
module HomeHelper
##lastfm = Lastfm.new(key, secret)
##wesRecent = ##lastfm.user.get_recent_tracks(:user => 'weskey5644')
that are initialized when the module is first read. In particular, ##wesRecent will be initialized once and then it will stay the same until you restart the server or happen to get a new server process. You should be able to call get_recent_tracks when you need it:
def _album_art_helper
trackHash = ##lastfm.user.get_recent_tracks(:user => 'weskey5644').first
#...
Note that this means that your two helpers won't necessarily be using the same track list.
You might want to add a bit of "only refresh the tracks at most once very minute" logic as well.
I have a basic search page that replaces html in a div with the search results.
Clicking on one of the search results brings up a detail page. And I have a link on that detail page that says "Go back to search results".
My problem is that it just generates a link to the root_url. Is there a way I can generate a valid back link to those AJAX search results?
Using the example of an employee index, you can store the last search in your session, like so:
def index
if params[:search]
session[:last_search] = params[:search]
#employees = Employee.search(params[:search])
elsif session[:last_search]
#employees = Employee.search(sessions[:last_search])
else
#employees = Employee.all
end
end
Substitute whatever searching method you're using, of course. And there are a lot of ways this will look different depending on what else you're doing. But this should show you how to use the session hash to your advantage.
I'm trying to use mechanize to perform a simple search on my college's class schedule db. The following code returns nil, however it works logging into facebook and searching google (with diff url/params). What am I doing wrong?
I'm following the latest (great) railscast here. Mechanize documentation has been useful but I'm still puzzled. Thanks in advance for your suggestions!
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
Remove search from form.submit.search i.e. form.submit I'm guessing you're appending search to submit thinking that it has something to do with the value of the submit button i.e. search.
What you're code is doing IS successfully submitting the form. However you are calling the search method of the resulting page object with a nil argument. The search method expects a selector e.g. 'body div#nav_bar ul.links li' as an argument for it to return an array of elements that match that selector. Of course no elements will match a nil selector, hence the empty array.
Edit per your response:
Your code:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl/")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit.search
=> []
What I tried and got to work:
ruby script/console
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("https://www.owens.edu/cgi-bin/class.pl")
agent.page.forms
form = agent.page.forms.last
form.occ_subject = "chm"
form.submit # <- No search method.
=> Insanely long array of HTML elements
The same code will not work with Google either:
require 'mechanize'
require 'nokogiri'
agent = WWW::Mechanize.new
agent.get("http://www.google.com")
form = agent.page.forms.last
form.q = "stackoverflow"
a = form.submit.search
b = form.submit
puts a
=> [] # <--- EMPTY!
puts b
#<WWW::Mechanize::Page
{url
#<URI::HTTP:0x1020ea878 URL:http://www.google.co.uk/search?hl=en&source=hp&ie=ISO-8859-1&q=stackoverflow&meta=>}
{meta}
{title "stackoverflow - Google Search"}
{iframes}
{frames}
{links
#<WWW::Mechanize::Page::Link
"Images"
"http://images.google.co.uk/images?hl=en&source=hp&q=stackoverflow&um=1&ie=UTF-8&sa=N&tab=wi">
#<WWW::Mechanize::Page::Link
"Videos"
…
The search method of a page object behaves like the search method of Nokogiri, in that it accepts a sequence of CSS selectors and/or XPath queries and returns an enumerable object of matching elements. e.g.
page.search('h3.r a.l', '//h3/a[#class="l"]')
The page returns a null result when it is queried through WWW::Mechanize.
I'm not sure if WWW::Mechanize can handle POSTING to this secure page.
"can't convert nil into String" means it can't show you in a text form what nothing is. It can't convert something from nothing.
It also might be a problem with the form and the script delay.
Try using curl for debugging, POSTing such as curl -d "occ_subject=chm" https://www.owens.edu/cgi-bin/class.pl, when I tried that it returned a page.
I think it's a problem with the secure page and the cgi script combined.