Nokogiri css selector contains isn't working all the time - ruby-on-rails

I wrote a script to extract the price from HTML:
sign = '$'
doc.css('*:contains("'+sign+'")').each do |element|
#Code that i wrote that extract the price from a text
end
The problem is, for some sites Nokogiri finds all the elements that contain a dollar sign, and for some sites it doesn't find even one.
Example for site that it doesn't find even one element: http://www.urbanoutfitters.com/urban/catalog/productdetail.jsp?id=39101399&category=W-SOUTH
What am I doing wrong?

Related

Prevent Ruby from changing & to &?

I need to display some superscript and subscript characters in my webpage title. I have a helper method that recognizes the pattern for a subscript or superscript, and converts it to &sub2; or ²
However, when it shows up in the rendered page's file, it shows up in the source code as:
&sub2;
Which is not right. I have it set up to be:
<% provide(:title, raw(format_title(#hash[:page_title]))) %>
But the raw is not working. Any help is appreciated.
Method:
def format_title(name)
label = name
if label.match /(_[\d]+_)+|(~[\d]+~)+/
label = label.gsub(/(_([\d]+)_)+/, '&sub\2;')
label = label.gsub(/(~([\d]+)~)+/, '&sup\2;')
label.html_safe
else
name
end
end
I have even tried:
str.gsub(/&/, '&')
but it gives back:
&amp;sub2;
You can also achieve this with Rails I18n.
<%= t(:page_title_html, scope: [:title]) %>
And in your respective locale file. (title.en.yml most probably):
title:
page_title: "Title with ²"
Here is a chart for HTML symbols regarding subscript and superscripts.
For more information check Preventing HTML character entities in locale files from getting munged by Rails3 xss protection
Update:
In case you need to load the page titles dynamically, first, you'll have to install a gem like Page Title Helper.
You can follow the guide in the gem documentation.
There are two of issues with your example, one is of matter and the other is just a coincidence.
The first issue is you are trying to use character entities that do not actually exist. Specifically, there are only ¹, ² and ³ which provide 1, 2 and 3 superscript symbols respectively. There is no such character entity as &sup4; nor any other superscript digits. There are though bare codepoints for other digits which you can use but this would require a more involved code.
More importantly, there are no subscript character entities at all in HTML5 character entities list. All subscript digits are bare codepoints. Therefore it makes no sense to replace anything with &sub2; or any other "subscript" digit.
The reason you didn't see your example working is due to the test string you chose. Supplying anything with underscores, like _2_mystring will be properly replaced with &sub2;. As &sub2; character entity is non-existent, the string will appear as is, creating an impression that raw method somehow doesn't work.
Try to use ~2~mystring and it will be replaced with the superscript character entity ² and will be rendered correctly. This illustrates that your code correct, but the basic assumption about character entities is not.

How to confirm partial line of text in Ruby

I'm writing a test to confirm that a csv file has hit my downloads folder. As the title of the csv file is set to include the date and time of the download, it's impractical to keep changing the name of the file in my feature. Example filename: fleet_123456_20140707_103015.csv
Can I include in my ruby code, something that will just confirm that the "fleet_123456" is present as it's the only generic part of the name that will appear on every download?
At the moment I have:
Then /^I should get a download with the filename "(.*?)"$/ do |file_name|
page.response_headers['Content-Disposition'].should include("filename=\"#{file_name}\"")
end
I'm thinking that the "#{file_name}\"") needs tweeking, just not sure where.
Any help would be great, thank you
You asked:
Can I include in my ruby code, something that will just confirm that the "fleet_123456" is present as it's the only generic part of the name that will appear on every download?
Yes, you can. One way would be to replace the include matcher with a regex-based one. For example, instead of
page.response_headers['Content-Disposition'].should include("filename=\"#{file_name}\"")
you could write
page.response_headers['Content-Disposition'].should match(/filename="fleet_[\d_]+.csv"/)
, which would match "fleet_123456" followed by any combination of numbers and underscores, followed ".csv". Another possibility, if you want to be a little more specific, is
page.response_headers['Content-Disposition'].should match(/filename="fleet_123456_\d+_\d+.csv"/)
which matches the specific arrangement of groups of numbers separated by underscores. You can read about regular expressions in Ruby here and play with them here.

Why it is returning an empty array while it has content?

I am trying to get auto-corrected spelling from Google's home page using Nokogiri.
For example, if I am typing "hw did" and the correct spelling is "how did", I have to get the correct spelling.
I tried with the xpath and css methods, but in both cases, I get the same empty array.
I got the XPath and CSS paths using FireBug.
Here is my Nokogiri code:
#requ=params[:search]
#requ_url=#requ.gsub(" ","+") //to encode the url(if user inputs space than it should be convet into + )
#doc=Nokogiri::HTML(open("https://www.google.co.in/search?q=#{#requ_url}"))
binding.pry
Here are my XPath and CSS selectors:
Using XPath:
pry(#<SearchController>)> #doc.xpath("/html/body/div[5]/div[2]/div[6]/div/div[4]/div/div/div[2]/div/p/a").inspect
=> "[]"
Using CSS:
pry(#<SearchController>)> #doc.css('html body#gsr.srp div#main div#cnt.mdm div.mw div#rcnt div.col div#center_col div#taw div div.med p.ssp a.spell').inner_text()
=> ""
First, use the right tools to manipulate URLs; They'll save you headaches.
Here's how I'd find the right spelling:
require 'nokogiri'
require 'uri'
require 'open-uri'
requ = 'hw did'
uri = URI.parse('https://www.google.co.in/search')
uri.query = URI.encode_www_form({'q' => requ})
doc = Nokogiri::HTML(open(uri.to_s))
doc.at('a.spell').text # => "how did"
it works fine with "how did",check it with "bnglore" or any one word string,it gives an error. the same i was facing in my previous code. it is showing undefined method `text'
It's not that hard to figure out. They're changing the HTML so you have to change your selector. "Inspect" the suggested word "bangalore" and see where it exists in relation to the previous path. Once you know that, it's easy to find a way to access the word:
doc.at('span.spell').next_element.text # => "bangalore"
Don't trust Google to do things the easy way, or even the best way, or be consistent. Just because they return HTML one way for words with spaces, doesn't mean they're going to do it the same way for a single word. I would do it consistently, but they might be trying to discourage you from mining their pages so don't be surprised if you see variations.
Now, you need to figure out how to write code that knows when to use one selector/method or the other. That's for you to do.

simple formatting/parsing in markdown for blockquotes

I'm using markdown in my site and I would like to do some simple parsing for news articles.
How can I parse markdown to pull all blockquotes and links, so I can highlight them separately from the rest of the document
For example I would like to parse the first blockquote ( >) in the document so I can push it to the top no matter where it occurs in the document. (Similar to what many news sites do, to highlight certain parts of an article.) but then de-blockquote it for the main body. So it occurs twice (once in the highlighted always at the top and then normally as it occurs in the document).
I will assume you're trying to do this at render-time, when the markdown is going to be converted to HTML. To point you in the right direction, one way you could go about this would be to
Convert the markdown to HTML
Pass the HTML to Nokogiri
Grab the first <blockquote>, copy it, and inject it into the top of the Nokogiri node tree
The result would be a duplicate of the first <blockquote>.
Redcarpet 2 is a great gem for converting Markdown to HTML. Nokogiri is your best bet for HTML parsing.
I can write sample code if necessary, but the documentation for both gems is thorough and this task is trivial enough to just piece together bits from examples within the docs. This at least answers your question of how to go about doing it.
Edit
Depending on the need, this could be done with a line of jQuery too.
$('article').prepend($($('article blockquote').get(0)).clone())
Given the <article> DOM element for an article on your page, grab the first <blockquote>, clone it, and prepend it to the top of the <article>.
I know wiki markup (i.e. wikicloth for ruby) has similar implementations as you're after for parsing links, categories, and references. Though I'm not sure about block quotes, but it may be better suited.
Something like:
data = "[[ this ]] is a [[ link ]] and another [http://www.google.com Google]. This is a <ref>reference</ref>, but this is a [[Category:Test]]. This is in another [[de:Sprache]]"
wiki = WikiCloth::Parser.new(:data => data)
wiki.to_html
puts "Internal Links: #{wiki.internal_links.size}"
puts "External Links: #{wiki.external_links.size}"
puts "References: #{wiki.references.size}"
puts "Categories: #{wiki.categories.size} [#{wiki.categories.join(",")}]"
puts "Languages: #{wiki.languages.size} [#{wiki.languages.keys.join(",")}]"
I haven't seen any such parsers available for markdown. Using redcarpet, converting to HTML, then using Nokogiri does seem a bit convoluted.

Find Google Map Line w/ Nokogiri

Using nokogiri I need to search through some HTML for something like:
new GLatLng(-14.468352,132.270434)
and then assign the latitude and longitude values in that code to two variables.
You haven't shown us any example HTML. Nokogiri seems to be the wrong tool for this job if you're just searching for plain text. You could simply do:
require 'open-uri'
html = open('http://stackoverflow.com/questions/6739202/find-google-map-line-w-nokogiri').read
match = /new GLatLng\((?<lat>.+?),(?<long>.+?)\)/.match html
p match[:lat].to_f
#=> -14.468352
Or, if you need an array of all such matches, say the page also has new GLatLng(17.3,42.1) on it:
matches = html.scan /new GLatLng\((.+?),(.+?)\)/
p matches
#=> [["-14.468352", "132.270434"],["17.3", "42.1"]]
The only reason you might want to use Nokogiri would be to limit your searching to a particular HTML element (e.g. some <script> block).

Resources