Xpath.each not working in rails - ruby-on-rails

My code:
require 'rexml/document'
require 'xpath'
doc = REXML::Document.new(xml)
XPath.each(doc, "*/categoryName") { |element| puts element.text }
I am trying to take object xml where xml is a string of xml... and retrieve some text ie - I want this text
I thought the code above would work, but it is giving me the following error:
undefined method `each' for XPath:Module

I'm not sure what 'xpath' library you're loading, but you don't want or need it. Confusing the matter is that REXML's documentation assumes that you have 'polluted' your global object via include REXML. Since you are not doing that, you need to provide the full path to the module:
require 'rexml/document'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category"){ |el| puts el.text }

I think you missed require rexml/xpath and try REXML::XPath.each. It will work.
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category") { |element| puts element.text }
One example:
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new("<p>some text <b>this is bold!</b> more text</p>")
REXML::XPath.each(doc, "*//b") { |element| puts element.text }
# >> this is bold!

Related

My scraped data is empty (Rails and mechanize)

I am writing a simple script to scrape data from this link: https://www.congress.gov/members.
The script will go through each link of the member, follow that link, and scrape data from that link. This script is a .rake file on Ruby on Rails application.
Below is the script:
require 'mechanize'
require 'date'
require 'json'
require 'openssl'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil
task :testing do
agent = Mechanize.new
page = agent.get("https://www.congress.gov/members")
page_links = page.links_with(href: %r{^/member/\w+})
product_links = page_links[0...2]
products = product_links.map do |link|
product = link.click
state = product.search('td:nth-child(1)').text
website = product.search('.member_website+ td').text
{
state: state,
website: website
}
end
puts JSON.pretty_generate(products)
end
and below is the output when i ran this script/file:
Your regular expression does not match links.
Try this: page_links = page.links_with(href: %r{.*/member/\w+})
You can validate regular expressions here: http://rubular.com/

How to save pictures from URL to disk

I want to download pictures from a URL, like: http://trinity.e-stile.ru/ and save images to a directory like "C:\pickaxe\pictures". It is important to use Nokogiri.
I read similar questions on this site, but I didn't find how it works and I didn't understand the algorithm.
I wrote the code where I parse the URL and put parts of the webpage source code with "img" tag into a links object:
require 'nokogiri'
require 'open-uri'
PAGE_URL="http://trinity.e-stile.ru/"
page=Nokogiri::HTML(open(PAGE_URL)) #parsing into object
links=page.css("img") #object with html code with img tag
puts links.length # it is 24 images on this url
puts
links.each{|i| puts i } #it looks like: <img border="0" alt="" src="/images/kroliku.jpg">
puts
puts
links.each{|link| puts link['src'] } #/images/kroliku.jpg
What method is used to save pictures after grabbing the HTML code?
How can I put the images into a directory on my disk?
I changed the code, but it has an error:
/home/action/.parts/packages/ruby2.1/2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: Name or service not known (SocketError)
This is the code now:
require 'nokogiri'
require 'open-uri'
require 'net/http'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL = "http://ruby.bastardsbook.com/files/hello-webpage.html"
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
}
You are almost done. The only thing left is to store files. Let’s do it.
LOCATION = 'C:\pickaxe\pictures'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
require 'net/http'
.... # your code with nokogiri etc.
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
end
That’s it.
The correct version:
require 'nokogiri'
require 'open-uri'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
uri = URI.join(PAGE_URL, link['src'] ).to_s # make absolute uri
localname=File.basename(link['src'])
File.open("#{LOCATION}/#{localname}",'wb') { |f| f.write(open(uri).read) }
}

Rails: Cannot Parse XML Response using Nokogiri

I'm basically trying to get the Lyric tag from the response I make to the ChartLyrics API. Here is the code I've written:
require 'nokogiri'
require 'open-uri'
request = Net::HTTP.get(URI.parse('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
puts request
response = Nokogiri::XML(request)
puts response.xpath("//Lyric")[0].to_s
I've read to the documentation but I did not find an answer. What I am doing wrong here?
Try the below code
require 'open-uri'
require 'nokogiri'
xml_doc = Nokogiri::XML(open('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
#I always prefer css than xpath
lyrics = xml_doc.css('Lyric')
if lyrics.empty?
puts "COuld not find any lyric in the XML document'
else
puts lyrics[0].to_s
end
Do response.remove_namespaces! before response.xpath

How to grep file name and extensions in webpage using nokogiri/hpricot and other gem?

I am working on an application where I have to
1) get all the links of website
2) and then get the list of all the files and file extensions in each
of the web page/link.
I am done with the first part of it :)
I get all the links of website by below code..
require 'rubygems'
require 'spidr'
require 'uri'
Spidr.site('http://testasp.vulnweb.com/') do |spider|
spider.every_url { |url|
puts url
}
end
now I have to get the all the files/file-extensions in each of the
page so I tried the below code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'spidr'
site = 'http://testasp.vulnweb.com'
in1=[]
Spidr.site(site) do |spider|
spider.every_url { |url| in1.push url }
end
in1.each do |input1|
input1 = input1.to_s
#puts input1
begin
doc = Nokogiri::HTML(open(input1))
doc.traverse do |el|
[el[:src], el[:href]].grep(/\.(txt|css|gif|jpg|png|pdf)$/i).map{|l| URI.join(input1, l).to_s}.each do |link|
puts link
end
end
rescue => e
puts "errrooooooooor"
end
end
but Can anybody guide me how to parse the links/webpage and get the file-
extensions in the page?
You might want to take a look at URI#parse. The URI module is a part of the Ruby standard library and is a dependency of the spidr gem. Example implementation with a spec for good measure.
require 'rspec'
require 'uri'
class ExtensionExtractor
def extract(uri)
/\A.*\/(?<file>.*\.(?<extension>txt|css|gif|jpg|png|pdf))\z/i =~ URI.parse(uri).path
{:path => uri, :file => file, :extension => extension}
end
end
describe ExtensionExtractor do
before(:all) do
#css_uri = "http://testasp.vulnweb.com/styles.css"
#gif_uri = "http://testasp.vulnweb.com/Images/logo.gif"
#gif_uri_with_param = "http://testasp.vulnweb.com/Images/logo.gif?size=350x350"
end
describe "Common Extensions" do
it "should extract CSS files from URIs" do
file = subject.extract(#css_uri)
file[:path].should eq #css_uri
file[:file].should eq "styles.css"
file[:extension].should eq "css"
end
it "should extract GIF files from URIs" do
file = subject.extract(#gif_uri)
file[:path].should eq #gif_uri
file[:file].should eq "logo.gif"
file[:extension].should eq "gif"
end
it "should properly extract extensions even when URIs have parameters" do
file = subject.extract(#gif_uri_with_param)
file[:path].should eq #gif_uri_with_param
file[:file].should eq "logo.gif"
file[:extension].should eq "gif"
end
end
end

Nokogiri parsing for metawords

I know this question has been asked earlier but I am not able to get the parsed result. I am trying to parse metawords using nokogiri, can any one point out my mistake?
keyword = []
meta_data = doc.xpath('//meta[#name="Keywords"]/#content') #parsing for keywords
meta_data.each do |meta|
keyword << meta.value
end
key_str=keyword.join(",")
I tried running this in irb as well but keyword returns a nil.
This is how I used it in irb
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("www.google.com")
have already tried alternatives from other stackoverflow posts like
Nokogiri html parsing question but of no use, they still return nil. I guess i am doing something wrong somewhere.
www.google.com does not have any meta keywords in the source. View Source on the page to see for yourself. So even if everything else went perfectly, you'd still get no results there.
The result of doc = Nokogiri::HTML("www.google.com") is
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>www.google.com</p></body></html>
If you want to fetch the contents of a URL, you want to use something like:
require 'open-uri'
doc = Nokogiri::HTML( open('http://www.google.com' ) )
If you get a valid HTML page, and use the proper casing on keywords to match the source, it works fine. Here's an example from my IRB session, fetching a page from one of the apps on my site that happens to use name="keywords" instead of name="Keywords":
irb(main):001:0> require 'open-uri'
#=> true
irb(main):002:0> require 'nokogiri'
#=> true
irb(main):003:0> url = "http://pentagonalrobin.phrogz.net/choose"
#=> "http://pentagonalrobin.phrogz.net/choose"
irb(main):04:0> doc = Nokogiri::HTML( open(url) ); nil # don't show doc here
#=> nil
irb(main):005:0> doc.xpath('//meta[#name="keywords"]/#content').map(&:value)
#=> ["team schedule free round-robin league"]

Resources