Construct URLs after scraping for image paths - ruby-on-rails

I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.
Examples:
http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png
http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png
http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png
http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png
http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=> http://s3.amazon-aws.com/bucket/image.png

Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.
require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}

If you really want to do it yourself (instead of using Mechanize, say), use URI::join:
require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
# => http://domain.com/images/small/image.png
Note that you have to respect the HTML page's BASE tag if there is one...

Related

How to extract only highlight data from pdf file in rails

I want to take out all the highlighted text from a pdf in rails does anyone have any idea I am not able to figure it out.Sample data
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
# Extract all text from a single PDF
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This is a basic text extraction using the gem mentioned above. This should get you a nice head start. You can grab all the text from the document then figure out how to grab those specific sections based on the data you receive back.

Nokogiri gem vs. opening by hand

I can't get Nokogiri to return the same thing I see when I go to a page and "View Source". And for the life of me can't figure out why.
This is the page I am looking at:
http://www.amazon.com/gp/product/B009NWFP5Q
And as you can see it returns a shoe that's orange..and if I view the source and find the link I'm looking for by searching for "hiRes" twice, I get:
http://ecx.images-amazon.com/images/I/71b75uTtzDL.UL1500.jpg
However, if I run this code with Nokogiri:
require 'nokogiri'
require 'open-uri'
require 'uri'
url = "http://www.amazon.com/gp/product/B009NWFP5Q"
doc = Nokogiri::HTML(open(url))
pic = doc.css('div#imageBlock_feature_div script')[0]
puts pic
and look for the link in the same position I get this image:
http://ecx.images-amazon.com/images/I/81R97WG9nyL.UL1500.jpg
which is a BLUE shoe!!! Arghhh..
Any idea why??
Maybe the color being shown is somehow based on your session or dynamic attributes assigned to a cookie stored in your browser. Find a way to provide a URL that will return exactly what you are looking for. It may also be possible to provide a cookie using the http client code but that seems like a plan B.

Why it is returning an empty array while it has content?

I am trying to get auto-corrected spelling from Google's home page using Nokogiri.
For example, if I am typing "hw did" and the correct spelling is "how did", I have to get the correct spelling.
I tried with the xpath and css methods, but in both cases, I get the same empty array.
I got the XPath and CSS paths using FireBug.
Here is my Nokogiri code:
#requ=params[:search]
#requ_url=#requ.gsub(" ","+") //to encode the url(if user inputs space than it should be convet into + )
#doc=Nokogiri::HTML(open("https://www.google.co.in/search?q=#{#requ_url}"))
binding.pry
Here are my XPath and CSS selectors:
Using XPath:
pry(#<SearchController>)> #doc.xpath("/html/body/div[5]/div[2]/div[6]/div/div[4]/div/div/div[2]/div/p/a").inspect
=> "[]"
Using CSS:
pry(#<SearchController>)> #doc.css('html body#gsr.srp div#main div#cnt.mdm div.mw div#rcnt div.col div#center_col div#taw div div.med p.ssp a.spell').inner_text()
=> ""
First, use the right tools to manipulate URLs; They'll save you headaches.
Here's how I'd find the right spelling:
require 'nokogiri'
require 'uri'
require 'open-uri'
requ = 'hw did'
uri = URI.parse('https://www.google.co.in/search')
uri.query = URI.encode_www_form({'q' => requ})
doc = Nokogiri::HTML(open(uri.to_s))
doc.at('a.spell').text # => "how did"
it works fine with "how did",check it with "bnglore" or any one word string,it gives an error. the same i was facing in my previous code. it is showing undefined method `text'
It's not that hard to figure out. They're changing the HTML so you have to change your selector. "Inspect" the suggested word "bangalore" and see where it exists in relation to the previous path. Once you know that, it's easy to find a way to access the word:
doc.at('span.spell').next_element.text # => "bangalore"
Don't trust Google to do things the easy way, or even the best way, or be consistent. Just because they return HTML one way for words with spaces, doesn't mean they're going to do it the same way for a single word. I would do it consistently, but they might be trying to discourage you from mining their pages so don't be surprised if you see variations.
Now, you need to figure out how to write code that knows when to use one selector/method or the other. That's for you to do.

Nokogiri- Parsing HTML <a href> and displaying only part of the URL

So basically I am scraping a website, and I want to display only part of the address. For instance, if it is www.yadaya.com/nyc/sales/manhattan and I want to only put "sales" in a hash or an array.
{
:listing_class => listings.css('a').text
}
That will give me the whole URL. Would I want to gsub to get the partial output?
Thanks!
When you are dealing with URLs, you should start with URI, then, to mess with the path, switch to using File.dirname and/or File.basename:
require 'uri'
uri = URI.parse('http://www.yadaya.com/nyc/sales/manhattan')
dir = File.dirname(uri.path).split('/').last
which sets dir to "sales".
No regex is needed, except what parse and split do internally.
Using that in your code's context:
File.dirname(URI.parse(listings.css('a').text).path).split('/').last
but, personally, I'd break that into two lines for clarity and readability, which translate into easier maintenance.
A warning though:
listings.css('a')
returns a NodeSet, which is akin to an Array. If the DOM you are searching has multiple <a> tags, you will get more than one Node being passed to text, which will then be concatenated into the text you are treating as a URL. That's a bug in waiting:
require 'nokogiri'
html = '<div>foobar</div>'
doc = Nokogiri::HTML(html)
doc.at('div').css('a').text
Which results in:
"foobar"
Instead, your code needs to be:
listings.at('a')
or
listings.at_css('a')
so only one node is returned. In the context of my sample code:
doc.at('div').at('a').text
# => "foo"
Even if the code that sets up listings only results in a single <a> node being visible, use at or at_css for correctness.
Since you have the full URL using listings.css('a').text, you could parse out a section of the path using a combination of the URI class and a regular expression, using something like the following:
require 'uri'
uri = URI.parse(listings.css('a').text)
=> #<URI::HTTP:0x007f91a39255b8 URL:http://www.yadaya.com/nyc/sales/manhattan>
match = %r{^/nyc/([^/]+)/}.match(uri.path)
=> #<MatchData "/nyc/sales/" 1:"sales">
match[1]
=> "sales"
You may need to tweak the regular expression to meet your needs, but that's the gist of it.

using 'puts' to get information from external domain

ive just started with ruby on rails the other day and i was wandering is it possible to using the puts function to get the content of a div from a page on an external page.
something like puts "http://www.example.com #about"
would something like this work ? or would you have to get the entire page and then puts that section that you wanted ?
additionaly if the content on the "example.com" #about div is constantly changing would puts constantly update its output or would it only run the script each time the page is refreshed ?
The open-uri library (for fetching the page) and the Nokogiri gem (for parsing and retrieving specific content) can assist with this.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com/'))
puts doc.at('#about').text
puts will not work that way. Ruby makes parsing HTML fairly easy though. Take a look at the Nokogirl library, and you can use xpath queries to get to the div you want to print out. I believe you would need to reopen the file if the div changes, but I'm not positive about that - you can easily test it (or someone here can confirm or reject that statement).

Resources