Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove
found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/
Nokogiri and Hpricot are pretty interchangeable. I.e. Nokogiri(html) is an equivalent of Hpricot(html). Not really sure I understand what the linked article is trying to achieve, but to:
Extract text from HTML body which includes ignoring large white spaces between tags and words.
This would be an easier approach in Hpricot, and remove the need for the hpricot.search("script").remove bits. I.e. Just get the body in the first place:
Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
And in Nokogiri:
Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
Related
I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"
I am trying to find the best way to find a word on a page using Nokogiri.
I have a page which has the following text.
<p>Modelo: ABC123-A</p>
I would like to find the "Modelo:" text, and then get the model number after it.
I have had a look around but can't seem to find. So, I thought I would post on here and see if anyone with experience of Nokogiri could shed some light on this for me.
Use p:contains selector and get the matching p nodes.
doc = Nokogiri::HTML('<html><body><p>Modelo: ABC123-A</p><br/><p>Nothing here</p><p>Modelo: 4321</p></body></html>')
doc.css('p:contains("Modelo")').map { |x| x.text.split(': ').last }
#=> ["ABC123-A", "4321"]
A simple example:
doc = Nokogiri::HTML('<html><body><p>Modelo: ABC123-A</p></body></html>')
doc.css('p').first.content # => Modelo: ABC123-A
str.split( ': ' )[-1] # => ABC123-A
You could also try Oga, it's lighter than Nokogiri.
I am trying to parse og meta tags using the HTTParty gem using this code:
link = http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
# link = http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
resp = HTTParty.get(link)
ret_body = resp.body
# title
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"\/\>/)
og_title = og_title[1].to_s
The problem is that it worked on some sites (yahoo!) but not others (usa today)
Don't parse HTML with regular expressions, because they're too fragile for anything but the simplest problems. A tiny change to the HTML can break the pattern, causing you to begin a slow battle of maintaining an ever expanding pattern. It's a war you won't win.
Instead, use a HTML parser. Ruby has Nokogiri, which is excellent. Here's how I'd do what you want:
require 'nokogiri'
require 'httparty'
%w[
http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
].each do |link|
resp = HTTParty.get(link)
doc = Nokogiri::HTML(resp.body)
puts doc.at('meta[property="og:title"]')['content']
end
Which outputs:
Jets fire offensive coordinator Tony Sparano
Chicago lottery winner's death ruled a homicide
Perhaps I can offer an easier solution? Check out the OpenGraph gem.
It's a simple library for parsing Open Graph protocol information from web sites and should solve your problem.
Solution:
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"[\s\/\>|\/\>]/)
og_title = og_title[1].to_s
Trailing whitespace messed up the parsing so make sure to check for that. I added an OR clause to the regex to allow for both trailing and non trailing whitespace.
Is there a simple way to print an unformated xml string to screen in a ruby on rails application? Something like a xml beautifier?
Ruby core REXML::Document has pretty printing:
REXML::Document#write( output=$stdout, indent=-1, transitive=false, ie_hack=false )
indent: An integer. If -1, no
indenting will be used; otherwise, the
indentation will be twice this number
of spaces, and children will be
indented an additional amount. For a
value of 3, every item will be
indented 3 more levels, or 6 more
spaces (2 * 3). Defaults to -1
An example:
require "rexml/document"
doc = REXML::Document.new "<a><b><c>TExt</c><d /></b><b><d/></b></a>"
out = ""
doc.write(out, 1)
puts out
Produces:
<a>
<b>
<c>
TExt
</c>
<d/>
</b>
<b>
<d/>
</b>
</a>
EDIT: Rails has already REXML loaded, so you only have to produce new document and then write the pretty printed XML to some string which then can be embedded in a <pre> tag.
What about the Nokogiri gem? Here is an example use.
I'm using Nokogiri to parse a return from the Rackspace API
so I'm using their sample code to
response = server.get '/customers/'+#user.customer_id.to_s+'/domains/', server.xml_format
doc = Nokogiri::XML::parse response.body
puts "xpath values"
doc.xpath("//name").each do |node|
puts
node.text
end
As my code to use Nokogiri to return the nodelist of nodes of the element
for some reason I seem to have missed something obvious and I just for the life of me cannot get it to parse the list of nodes and return them to me, is there something simple I can do to fix to have it return the list of nodes?
Here's an example of the XML I'm trying to parse:
<domainList xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="urn:xml:domainList">
<offset>0</offset>
<size>50</size>
<total>4</total>
<domains>
<domain>
<name>domain1.com</name>
<accountNumber>xxxxxxx</accountNumber>
<serviceType>exchange</serviceType>
</domain>
<domain>
<name>domain2.com</name>
<accountNumber>xxxxxxx</accountNumber>
<serviceType>exchange</serviceType>
</domain>
<domain>
<name>domain3.com</name>
<accountNumber>xxxxxxx</accountNumber>
<serviceType>exchange</serviceType>
</domain>
</domains>
</domainList>
Cheers
The issue seems to be that you have to tell Nokogiri about their namespace.
If you remove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="urn:xml:domainList" from your domainLists tag you'd see your query work.
Otherwise you need to tell Nokogiri about that namespace.
doc.xpath("//blarg:name", {'blarg' => 'urn:xml:domainList'}).each do |name|
puts name.text
end
Nokogiri xpath takes a second argument which is a hash of namespaces. The xml you have defines a general namespace but doesn't give it a tag. I don't know if there is a way for nokogiri to just find this, so instead on your searches just give your search an arbitrary tag and associate the namespace path with that tag. You can put whatever text you want instead of blarg, it was just for the example.