Nokogiri parsing for metawords - ruby-on-rails

I know this question has been asked earlier but I am not able to get the parsed result. I am trying to parse metawords using nokogiri, can any one point out my mistake?
keyword = []
meta_data = doc.xpath('//meta[#name="Keywords"]/#content') #parsing for keywords
meta_data.each do |meta|
keyword << meta.value
end
key_str=keyword.join(",")
I tried running this in irb as well but keyword returns a nil.
This is how I used it in irb
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("www.google.com")
have already tried alternatives from other stackoverflow posts like
Nokogiri html parsing question but of no use, they still return nil. I guess i am doing something wrong somewhere.

www.google.com does not have any meta keywords in the source. View Source on the page to see for yourself. So even if everything else went perfectly, you'd still get no results there.
The result of doc = Nokogiri::HTML("www.google.com") is
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>www.google.com</p></body></html>
If you want to fetch the contents of a URL, you want to use something like:
require 'open-uri'
doc = Nokogiri::HTML( open('http://www.google.com' ) )
If you get a valid HTML page, and use the proper casing on keywords to match the source, it works fine. Here's an example from my IRB session, fetching a page from one of the apps on my site that happens to use name="keywords" instead of name="Keywords":
irb(main):001:0> require 'open-uri'
#=> true
irb(main):002:0> require 'nokogiri'
#=> true
irb(main):003:0> url = "http://pentagonalrobin.phrogz.net/choose"
#=> "http://pentagonalrobin.phrogz.net/choose"
irb(main):04:0> doc = Nokogiri::HTML( open(url) ); nil # don't show doc here
#=> nil
irb(main):005:0> doc.xpath('//meta[#name="keywords"]/#content').map(&:value)
#=> ["team schedule free round-robin league"]

Related

Find within the first 10?

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.
title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874
Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error
Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

Xpath.each not working in rails

My code:
require 'rexml/document'
require 'xpath'
doc = REXML::Document.new(xml)
XPath.each(doc, "*/categoryName") { |element| puts element.text }
I am trying to take object xml where xml is a string of xml... and retrieve some text ie - I want this text
I thought the code above would work, but it is giving me the following error:
undefined method `each' for XPath:Module
I'm not sure what 'xpath' library you're loading, but you don't want or need it. Confusing the matter is that REXML's documentation assumes that you have 'polluted' your global object via include REXML. Since you are not doing that, you need to provide the full path to the module:
require 'rexml/document'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category"){ |el| puts el.text }
I think you missed require rexml/xpath and try REXML::XPath.each. It will work.
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category") { |element| puts element.text }
One example:
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new("<p>some text <b>this is bold!</b> more text</p>")
REXML::XPath.each(doc, "*//b") { |element| puts element.text }
# >> this is bold!

Rails: Cannot Parse XML Response using Nokogiri

I'm basically trying to get the Lyric tag from the response I make to the ChartLyrics API. Here is the code I've written:
require 'nokogiri'
require 'open-uri'
request = Net::HTTP.get(URI.parse('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
puts request
response = Nokogiri::XML(request)
puts response.xpath("//Lyric")[0].to_s
I've read to the documentation but I did not find an answer. What I am doing wrong here?
Try the below code
require 'open-uri'
require 'nokogiri'
xml_doc = Nokogiri::XML(open('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
#I always prefer css than xpath
lyrics = xml_doc.css('Lyric')
if lyrics.empty?
puts "COuld not find any lyric in the XML document'
else
puts lyrics[0].to_s
end
Do response.remove_namespaces! before response.xpath

How to find the href element value in "<a>" tag with ruby

My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
url = <em>Gallon</em> - Wikipedia, the free encyclopedia
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
You can get the value of attributes like this
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Instead of converting to a string with url = site.to_s do url = site[0].attributes['href']
try to use:
site = doc.search("a[#href]")[16,1]
Waitir is a reasonable choice to check the layout of a web page.
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
Since the input is always going to follow the same format, you could just do:
url.split("href=\"").last.split("\"").first

Mechanize - How to follow or "click" Meta refreshes in rails

I have a bit trouble with Mechanize.
When a submit a form with Mechanize. I am come to a page with one meta refresh and there is no links.
My question is how do i follow the meta refresh?
I have tried to allow meta refresh but then i get a socket error.
Sample code
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("http://euroads.dk")
form = agent.page.forms.first
form.username = "username"
form.password = "password"
form.submit
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
agent.page.body
The response:
<html>
<head>
<META HTTP-EQUIV=\"Refresh\" CONTENT=\"0;URL=index.php?showpage=m_frontpage\">
</head>
</html>
Then I try:
redirect_url = page.parser.at('META[HTTP-EQUIV=\"Refresh\"]')[
"0;URL=index.php?showpage=m_frontpage\"][/url=(.+)/, 1]
But I get:
NoMethodError: Undefined method '[]' for nil:NilClass
Internally, Mechanize uses Nokogiri to handle parsing of the HTML into a DOM. You can get at the Nokogiri document so you can use either XPath or CSS accessors to dig around in a returned page.
This is how to get the redirect URL with Nokogiri only:
require 'nokogiri'
html = <<EOT
<html>
<head>
<meta http-equiv="refresh" content="2;url=http://www.example.com/">
</meta>
</head>
<body>
foo
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
redirect_url = doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
redirect_url # => "http://www.example.com/"
doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1] breaks down to: Find the first occurrence (at) of the CSS accessor for the <meta> tag with an http-equiv attribute of refresh. Take the content attribute of that tag and return the string following url=.
This is some Mechanize code for a typical use. Because you gave no sample code to base mine on you'll have to work from this:
agent = Mechanize.new
page = agent.get('http://www.examples.com/')
redirect_url = page.parser.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
page = agent.get(redirect_url)
EDIT: at('META[HTTP-EQUIV=\"Refresh\"]')
Your code has the above at(). Notice that you are escaping the double-quotes inside a single-quoted string. That results in a backslash followed by a double-quote in the string which is NOT what my sample uses, and is my first guess for why you're getting the error you are. Nokogiri can't find the tag because there is no <meta http-equiv=\"Refresh\"...>.
EDIT: Mechanize has a built-in way to handle meta-refresh, by setting:
agent.follow_meta_refresh = true
It also has a method to parse the meta tag and return the content. From the docs:
parse(content, uri)
Parses the delay and url from the content attribute of a meta tag. Parse requires the uri of the current page to infer a url when no url is specified. If a block is given, the parsed delay and url will be passed to it for further processing.
Returns nil if the delay and url cannot be parsed.
# <meta http-equiv="refresh" content="5;url=http://example.com/" />
uri = URI.parse('http://current.com/')
Meta.parse("5;url=http://example.com/", uri) # => ['5', 'http://example.com/']
Meta.parse("5;url=", uri) # => ['5', 'http://current.com/']
Meta.parse("5", uri) # => ['5', 'http://current.com/']
Meta.parse("invalid content", uri) # => nil
Mechanize treats meta refresh elements just like links without text. Thus, your code can be as simple as this:
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
page.meta_refresh.first.click

Resources