Rails: Cannot Parse XML Response using Nokogiri - ruby-on-rails

I'm basically trying to get the Lyric tag from the response I make to the ChartLyrics API. Here is the code I've written:
require 'nokogiri'
require 'open-uri'
request = Net::HTTP.get(URI.parse('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
puts request
response = Nokogiri::XML(request)
puts response.xpath("//Lyric")[0].to_s
I've read to the documentation but I did not find an answer. What I am doing wrong here?

Try the below code
require 'open-uri'
require 'nokogiri'
xml_doc = Nokogiri::XML(open('http://api.chartlyrics.com/apiv1.asmx/GetLyric?lyricId=1710&lyricCheckSum=a4a56a99ee00cd8e67872a7764d6f9c6'))
#I always prefer css than xpath
lyrics = xml_doc.css('Lyric')
if lyrics.empty?
puts "COuld not find any lyric in the XML document'
else
puts lyrics[0].to_s
end

Do response.remove_namespaces! before response.xpath

Related

How to save pictures from URL to disk

I want to download pictures from a URL, like: http://trinity.e-stile.ru/ and save images to a directory like "C:\pickaxe\pictures". It is important to use Nokogiri.
I read similar questions on this site, but I didn't find how it works and I didn't understand the algorithm.
I wrote the code where I parse the URL and put parts of the webpage source code with "img" tag into a links object:
require 'nokogiri'
require 'open-uri'
PAGE_URL="http://trinity.e-stile.ru/"
page=Nokogiri::HTML(open(PAGE_URL)) #parsing into object
links=page.css("img") #object with html code with img tag
puts links.length # it is 24 images on this url
puts
links.each{|i| puts i } #it looks like: <img border="0" alt="" src="/images/kroliku.jpg">
puts
puts
links.each{|link| puts link['src'] } #/images/kroliku.jpg
What method is used to save pictures after grabbing the HTML code?
How can I put the images into a directory on my disk?
I changed the code, but it has an error:
/home/action/.parts/packages/ruby2.1/2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: Name or service not known (SocketError)
This is the code now:
require 'nokogiri'
require 'open-uri'
require 'net/http'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL = "http://ruby.bastardsbook.com/files/hello-webpage.html"
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
}
You are almost done. The only thing left is to store files. Let’s do it.
LOCATION = 'C:\pickaxe\pictures'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
require 'net/http'
.... # your code with nokogiri etc.
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
end
That’s it.
The correct version:
require 'nokogiri'
require 'open-uri'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
uri = URI.join(PAGE_URL, link['src'] ).to_s # make absolute uri
localname=File.basename(link['src'])
File.open("#{LOCATION}/#{localname}",'wb') { |f| f.write(open(uri).read) }
}

Xpath.each not working in rails

My code:
require 'rexml/document'
require 'xpath'
doc = REXML::Document.new(xml)
XPath.each(doc, "*/categoryName") { |element| puts element.text }
I am trying to take object xml where xml is a string of xml... and retrieve some text ie - I want this text
I thought the code above would work, but it is giving me the following error:
undefined method `each' for XPath:Module
I'm not sure what 'xpath' library you're loading, but you don't want or need it. Confusing the matter is that REXML's documentation assumes that you have 'polluted' your global object via include REXML. Since you are not doing that, you need to provide the full path to the module:
require 'rexml/document'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category"){ |el| puts el.text }
I think you missed require rexml/xpath and try REXML::XPath.each. It will work.
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new(xml)
REXML::XPath.each( doc, "*/category") { |element| puts element.text }
One example:
require 'rexml/document'
require 'rexml/xpath'
doc = REXML::Document.new("<p>some text <b>this is bold!</b> more text</p>")
REXML::XPath.each(doc, "*//b") { |element| puts element.text }
# >> this is bold!

Nokogiri parsing for metawords

I know this question has been asked earlier but I am not able to get the parsed result. I am trying to parse metawords using nokogiri, can any one point out my mistake?
keyword = []
meta_data = doc.xpath('//meta[#name="Keywords"]/#content') #parsing for keywords
meta_data.each do |meta|
keyword << meta.value
end
key_str=keyword.join(",")
I tried running this in irb as well but keyword returns a nil.
This is how I used it in irb
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("www.google.com")
have already tried alternatives from other stackoverflow posts like
Nokogiri html parsing question but of no use, they still return nil. I guess i am doing something wrong somewhere.
www.google.com does not have any meta keywords in the source. View Source on the page to see for yourself. So even if everything else went perfectly, you'd still get no results there.
The result of doc = Nokogiri::HTML("www.google.com") is
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>www.google.com</p></body></html>
If you want to fetch the contents of a URL, you want to use something like:
require 'open-uri'
doc = Nokogiri::HTML( open('http://www.google.com' ) )
If you get a valid HTML page, and use the proper casing on keywords to match the source, it works fine. Here's an example from my IRB session, fetching a page from one of the apps on my site that happens to use name="keywords" instead of name="Keywords":
irb(main):001:0> require 'open-uri'
#=> true
irb(main):002:0> require 'nokogiri'
#=> true
irb(main):003:0> url = "http://pentagonalrobin.phrogz.net/choose"
#=> "http://pentagonalrobin.phrogz.net/choose"
irb(main):04:0> doc = Nokogiri::HTML( open(url) ); nil # don't show doc here
#=> nil
irb(main):005:0> doc.xpath('//meta[#name="keywords"]/#content').map(&:value)
#=> ["team schedule free round-robin league"]

How to find the href element value in "<a>" tag with ruby

My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
url = <em>Gallon</em> - Wikipedia, the free encyclopedia
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
You can get the value of attributes like this
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Instead of converting to a string with url = site.to_s do url = site[0].attributes['href']
try to use:
site = doc.search("a[#href]")[16,1]
Waitir is a reasonable choice to check the layout of a web page.
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
Since the input is always going to follow the same format, you could just do:
url.split("href=\"").last.split("\"").first

FasterCSV: Read Remote CSV Files

I can't seem to get this to work. I want to pull a CSV file from a different webserver to read in my application. This is how I'd like to call it:
url = 'http://www.testing.com/test.csv'
records = FasterCSV.read(url, :headers => true, :header_converters => :symbol)
But that doesn't work. I tried Googling, and all I came up with was this excerpt: Practical Ruby Gems
So, I tried modifying it as follows:
require 'open-uri'
url = 'http://www.testing.com/test.csv'
csv_url = open(url)
records = FasterCSV.read(csv_url, :headers => true, :header_converters => :symbol)
... and I get a can't convert Tempfile into String error (coming from the FasterCSV gem).
Can anyone tell me how to make this work?
require 'open-uri'
url = 'http://www.testing.com/test.csv'
open(url) do |f|
f.each_line do |line|
FasterCSV.parse(line) do |row|
# Your code here
end
end
end
http://www.ruby-doc.org/core/classes/OpenURI.html
http://fastercsv.rubyforge.org/
I would retrieve the file with Net::HTTP for example and feed that to FasterCSV
Extracted from ri Net::HTTP
require 'net/http'
require 'uri'
url = URI.parse('http://www.example.com/index.html')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/index.html')
}
puts res.body
You just had a small typo. You should have used FasterCSV.parse instead of FasterCSV.read:
data = open('http://www.testing.com/test.csv')
records = FasterCSV.parse(data)
I would download it with rio - as easy as:
require 'rio'
require 'fastercsv'
array_of_arrays = FasterCSV.parse(rio('http://www.example.com/index.html').read)
I upload CSV file with Paperclip and save it to Cloudfiles and then start file processing with Delayed_job.
This worked for me:
require 'open-uri'
url = 'http://www.testing.com/test.csv'
open(url) do |file|
FasterCSV.parse(file.read) do |row|
# Your code here
end
end

Resources