Parsing with Ruby Mechanize - ruby-on-rails

Im trying to parse a website using the Mechanize Gem. So far this is what I have:
page = agent.get("http://www.greatgiftsformen.com/price-range-under-c-131_142.html?page=all")
page.parser.xpath('//tr[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//*[contains(concat( " ", #class, " " ), concat( " ", "productListing-data", " " ))]')[5]
and I get the elements for this product back:
=> #<Nokogiri::XML::Element:0x81c175ec name="td" attributes=[#<Nokogiri::XML::Attr:0x81c17d58 name="valign" value="top">, #<Nokogiri::XML::Attr:0x81c17eac name="align" value="center">, #<Nokogiri::XML::Attr:0x81c17ec0 name="class" value="productListing-data">] children=[#<Nokogiri::XML::Element:0x805fa174 name="a" attributes=[#<Nokogiri::XML::Attr:0x81c13794 name="href" value="http://www.greatgiftsformen.com/gas-pump-retro-liquor-dispenser-p-249.html?osCsid=05f5dbb816874ece6db883c2c48d7ae1">] children=[#<Nokogiri::XML::Element:0x8068e270 name="img" attributes=[#<Nokogiri::XML::Attr:0x81c115ac name="src" value="product_thumb.php?img=images/prod/liquordisp-gas.jpg&w=160&h=160">, #<Nokogiri::XML::Attr:0x81c115c0 name="width" value="160">, #<Nokogiri::XML::Attr:0x81c115d4 name="height" value="160">, #<Nokogiri::XML::Attr:0x81c11714 name="border" value="0">, #<Nokogiri::XML::Attr:0x81c11728 name="alt" value="Gas Pump Retro Liquor Dispenser">, #<Nokogiri::XML::Attr:0x81c11750 name="title" value="Gas Pump Retro Liquor Dispenser">, #<Nokogiri::XML::Attr:0x81c11764 name="class" value="fotgal">]>]>]>
however when I try to get the href, I get back nil:
url = item.attributes['href']
=> nil

Needed to add the child nodes:
url = item.children[0].attributes['href'].to_s

Related

Nokogiri miss html inner text if it contains "<"

I am writing a rake task to change HTML string to JSON for which I am using Nokogiri to parse the HTML string and build JSON, everything is going fine until I noticed that if I have an inner text like
< 109
or
> 109
then nokogiri returns "109" instead of "> 109" or " < 109"
if I have a string like
str = <td>< 109</td>
then
result = Nokogiri::XML(str)
will return
#(Document:0x115f8 {
name = "document",
children = [ #(Element:0x1160c { name = "td", children = [ #(Text " 109")] })]
})
and
result.children.children.to_s
will return " 109" but i need "< 109"
How can i get desire result?
I am expecting to get "< 109" instaed of just " 109"
You could replace Nokogiri::XML with Nokogiri::HTML, which is more permissive with incorrect syntax :
Nokogiri::XML('<td>< 109</td>').children.last.text # => " 109"
Nokogiri::HTML('<td>< 109</td>').children.last.text # => "< 109"
It's a broken HTML, if this is the only issue that you are trying to solve then you can fix HTML before parsing it. You can replace all < with &lt.
str = '<td>< 109</td>'
fixed_str = str.gsub(/>< ([0-9]+)</, '>< \1<')
=> "<td>< 109</td>"
result = Nokogiri::XML(str)
=> #(Document:0x2ac1be2860cc { name = "document", children = [ #(Element:0x2ac1be282940 { name = "td", children = [ #(Text "< 109")] })] })
If there are > chars too
fixed_str = str.gsub(/>< ([0-9]+)</, '>< \1<').gsub(/>> ([0-9]+)</, '>> \1<')

How to replace last free space to nbsp in ruby

How I can replace last free space to in ruby?
I have in database this:
<h1>Hello dear friend!</h1>
<p>How are you?</p>
<figure><img src="..." alt="..." /></figure>
<p>Bye!</p>
And I need to have this "output":
<h1>Hello dear friend!</h1>
<p>How are you?</p>
<figure><img src="..." alt="..." /></figure>
<p>Bye!</p>
I tried to play with nokogiri:
text = Nokogiri::HTML::DocumentFragment.parse(...)
text.css('h1, h2, h3, h4, h5, h6, p, li').each do |tag|
tag_arr = tag.content.split(' ')
tag_last_words = tag_arr[tag_arr.length-2..tag_arr.length]
tag_return = tag_arr[0..-2].push(tag_last_words.join(' '))
tag_return = tag_return.join(' ')
tag.content = tag_return
end
but the I can't beat some "bugs":
all attributes and inner tags (html) are deleted
instead of I have &nbsp;
Why? To avoid single word wrapping to new line on mobile device. (JS is not an option in my case)

GCI.unescape gives <code> tags generation

def coderay(text)
text.gsub(/\<pre( )?\="" lang="(.+?)">\<code( )?\="" lang="(.+?)">(.+?)\<\/code\>\<\/pre\>/m) do
lang = $4
text = CGI.unescapeHTML($5).gsub /\<code( )?\="" lang="(.+?)">|\<\/code\>/, ""
text = text.gsub('<br />', "\n")
text = text.gsub(/[\<]([\/])*([A-Za-z0-9])*[\>]/, '')
text = text.gsub('>', ">")
text = text.gsub('<', "<")
text = text.gsub(' ', " ")
text = text.gsub('&', "&")
CodeRay.scan(text, lang).div(:css => :class)
end
end
The above generates this at the last closing "end":
</code(></code(></pre(>
Anyone knows why? I am using gems CodeRay 1.1.0 and RedCloth 4.2.9. Ruby version: 2.1.1 and Rails 3.2.19. RefineryCMS 2.1.3 and their blog engine.
I thought that this line was the cure but it is not:
text = CGI.unescapeHTML($5).gsub /\<code( )?\="" lang="(.+?)">|\<\/code\>/, ""
Edited:
This is in the show.html.erb file:
<%= raw (coderay(RedCloth.new(render 'post').to_html)) %>

grails jquery auto complete selected ID for another secondary auto complete function

I have been following the guide from Alidad's blog to enable jquery auto completion within grails:
http://alidadasb.blogspot.co.uk/2011/12/enabling-jquery-autocomplete-with.html
Country.groovy
package rentspace
class Country {
String name
static hasMany = [cities:City]
}
City.groovy
package rentspace
class City {
//static belongsTo = [country:Country]
static belogsTo = Country
Country country
String name
static constraints = {}
}
GSP Page:
<g:autoComplete id="countrySearch"
action='autocompleteCountry'
controller='any'
domain='rentspace.Country'
searchField='name'
collectField='id'
value=''
/>
<g:textField id="hiddenState" name="hiddenState" value=""/>
<label>City:</label>
<g:autoComplete name="citySearch" id="citySearch"
cid=""
action='autocompleteCityAction'
controller='any'
domain='rentspace.City'
searchField='name'
value=''
/>
AutoCompleteTagLib.groovy
package rentspace
class AutoCompleteTagLib {
..
if (attrs.style) styles = " styles='${attrs.style}'"
if (attrs.cid)
cid="&cid="+attrs.cid
else
cid=""
.......
out << "&order="+attrs.order
out << ""+cid
out << "&collectField="+attrs.collectField
out << "',select: function(event, ui) {"
out << " \$('#hiddenState').val(ui.item.id);},"
//out << " \$('#citySearch').attr('cid',ui.item.id);},"
out <<" search: function() {"
out << "\$('#hiddenState').val('');"
//out << "\$('#citySearch').attr('cid','');"
out <<"}"
out << ", dataType: 'json'"
out << "});"
out << " });"
out << "</script>"
}
def autoCompleteHeader = {
out << "<style>"
out << ".ui-autocomplete-loading"
out << " { background: white url(${resource(dir:'images',file:'ajax-loader.gif')}) right center no-repeat }"
out << " </style>"
}
}
My question is related to dual values returned by jquery, it does appear to be working if I return the value to a hidden or text field box. What I am trying to do is return the country id to the second auto complete box being citySearch.
So once the user auto completes the country the country id is returned as cid='1' or whatever the id is to the cid attribute of citySearch auto complete box.
in the tag lib there is a segment commented out which is where it succeeds in returning or updating the value of hiddenState field but no matter what attempt made to update the cid value I keep on failing ?
//out << " \$('#citySearch').attr('cid',ui.item.id);},"
Has anyone succeeded in doing anything like this ?
E2A:
https://github.com/vahidhedayati/grailscountrycity
Project can be downloaded from above link, there is some more information regarding the issue within the readme
Issue has been solved here by Alidad: { countryid: \$('#hiddenField').val() }
https://github.com/alidadasb/CountryCityAutoComplete
out << "\$('#" + attrs.id+"').autocomplete({ "
out << " source: "
out << " function(request, response) { "
out << " \$.getJSON(' "
out << createLink(link)
out << "?"
out << "term=' + request.term + '"
out << "&domain="+ attrs.domain
out << "&searchField="+attrs.searchField
out << "&max="+attrs.max
out << "&order="+attrs.order
out << "&collectField="+attrs.collectField
out << "', { countryid: \$('#hiddenField').val() }, "
out << " response); } "
out << ", dataType: 'json'"
out << "});});"
out << "</script>"
Working version can be seen here:
http://countrycity.cloudfoundry.com

how to search a particular string in rails?

I have out put like below and i want to search for a particular string and it's value how to do this in rails.
ex : i want search for Nokogiri::XML::Text pattern and get all pattern matching Nokogiri::XML::Text
#<LinkedIn::Location:0x4a339e8 #doc=#<Nokogiri::XML::Document:0x2519cb8 name="document" children=[#<Nokogiri::XML::Element:0x2519a0c name="per
son" children=[#<Nokogiri::XML::Text:0x25197f0 "\n ">, #<Nokogiri::XML::Element:0x251970c name="`" children=[#<Nokogiri::XML::Text:0x2
519124 "\n ">, #<Nokogiri::XML::Element:0x25190f4 name="name" children=[#<Nokogiri::XML::Text:0x2518bd8 "Bengaluru Area, India">]>, #<Nokog
iri::XML::Text:0x2518b00 "\n ">, #<Nokogiri::XML::Element:0x2518ad0 name="country" children=[#<Nokogiri::XML::Text:0x2518680 "\n ">, #
<Nokogiri::XML::Element:0x2518650 name="code" children=[#<Nokogiri::XML::Text:0x2517fcc "in">]>, #<Nokogiri::XML::Text:0x2517eb8 "\n ">]>,
#<Nokogiri::XML::Text:0x2517cd8 "\n ">]>, #<Nokogiri::XML::Text:0x2517bdc "\n">]>]>>
You can get the value like :
Way-1 :
reader = Nokogiri::XML::Reader(xml)
reader.read #Moves to next node in document
reader.attribute("cdn") # To get the value of attributes
Way-2 :
doc = Nokogiri::XML(xml)
elems = doc.xpath("//*[#messageId]") #get all elements with an attribute of 'messageId'
elems[0].attr('messageId') #gets value of attribute of first elem

Resources