RegEx Not working in Ruby! - ruby-on-rails

I am using the following regex
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
to match the name [ Burkhart, Peterson & Company ] in this
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>

Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.
Which that being said, your . does not match newlines. Use the m modifier for your regexp which tells the . to match new lines. Or the Regexp::MULTILINE constant. Documented here
Your regular expression is also capturing the HTML before the text you require.
Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML DATA.read
p doc.at("td[#class='generalinfo_right']").text
__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Which will return "Burkhart, Peterson & Company"

/m makes the dot match newlines

You'll want to use /m for multiline mode:
str.scan(/Name:</td>(.*?)</td>/m)

html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s)) doesn't match the new line characters; even if it would match those characters, the (.*?) part would grab everything after </td>, including <td class="generalinfo_right">.
To make the regular expression more generic, and allow to match the exact text you want, you should change the code to
html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))
The regular expression could be better written, though.
I would also not suggest to parse HTML/XHTML content with regular expression.

You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.
I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.
Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson & Company</td>
ree-1.8.7-2010.02'> '
=> "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson & Company</td>\n"
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
=> [["\n<td class="generalinfo_right">Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
=> [["Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 >

Related

Capybara rejects text with "<" (lower than character)

I have a spec to test if I'm able to show some different characters inside a given element. Say we have the element:
<p class="my-strange-characters-text">
"Here they are: \" & ; ' > # <"
</p>
The problem is that, with Capybara's default driver it doesn't retrieve the "<" character.
In my spec, if I do:
first(".my-strange-characters-text").text
The output is
Here they are: \" & ; ' > #
No "<" character! (nor whatever I insert after)
BUT, if i use :js => true, that will invoke the Poltergeist driver, it returns the text correctly.
I don't want to use :js => true on this specific text.
Obs:
I've tried '<', \< and other tricks to make it appear, but no success.
Any hint?
In HTML, a literal < is written as <. Ampersands must also be replaced with &
Browsers try their best to interpret invalid HTML, which is probably why poltergeist (which under the hood is using WebKit) is able to guess at the text you wanted to insert.

Nokogiri ignores everything after the first attribute because of backslashes?

Why does Nokogiri ignore everything after the first attribute because of backslashes?
I'm not really sure why it's doing this:
[12] pry(Template)> b
=> "<td style=\\\"color:#fff; padding:3px; font-size:11px; text-align:center;\\\">Home Improvement Agreement: Electrical Services & Standby Generators</td>"
[13] pry(Template)> Nokogiri::HTML.parse(b).to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><td style='\\\"color:#fff;' padding:3px font-size:11px text-align:center>Home Improvement Agreement: Electrical Services & Standby Generators</td></body></html>\n"
Notice how it produced bad HTML, as in everything after the color attribute in the <td> element. It closed out the attribute, and assigned the rest of the variables as HTML name tags I guess.
I'm curious if anyone knows why Nokogiri would do this, and what I can do to circumvent it?
You are asking it to parse this:
<td style=\"color:#fff; ...\">
which is not valid. This is would be valid:
<td style="color:#fff; ...">
Try:
'<td style="color:#fff; padding:3px; font-size:11px; text-align:center;">Home Improvement Agreement: Electrical Services & Standby Generators</td>'
Nokogiri makes it easy to tell whether there is a problem parsing the HTML or XML document:
require 'nokogiri'
html = '<td style=\"color:#fff; padding:3px; font-size:11px; text-align:center;\">Home Improvement Agreement: Electrical Services & Standby Generators</td>'
doc = Nokogiri::HTML.parse(html)
doc.errors
=> [#<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: htmlParseEntityRef: no name>]

How to beautify xml code in rails application

Is there a simple way to print an unformated xml string to screen in a ruby on rails application? Something like a xml beautifier?
Ruby core REXML::Document has pretty printing:
REXML::Document#write( output=$stdout, indent=-1, transitive=false, ie_hack=false )
indent: An integer. If -1, no
indenting will be used; otherwise, the
indentation will be twice this number
of spaces, and children will be
indented an additional amount. For a
value of 3, every item will be
indented 3 more levels, or 6 more
spaces (2 * 3). Defaults to -1
An example:
require "rexml/document"
doc = REXML::Document.new "<a><b><c>TExt</c><d /></b><b><d/></b></a>"
out = ""
doc.write(out, 1)
puts out
Produces:
<a>
<b>
<c>
TExt
</c>
<d/>
</b>
<b>
<d/>
</b>
</a>
EDIT: Rails has already REXML loaded, so you only have to produce new document and then write the pretty printed XML to some string which then can be embedded in a <pre> tag.
What about the Nokogiri gem? Here is an example use.

Can I use a regular expression to extract the domain from a URL?

Suppose I want to turn this :
http://en.wikipedia.org/wiki/Anarchy
into this :
en.wikipedia.org
or even better, this :
wikipedia.org
Is this even possible in regex?
Why use a regex when Ruby has a library for it? The URI library:
ruby-1.9.1-p378 > require 'uri'
=> true
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
=> #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy>
ruby-1.9.1-p378 > uri.host
=> "en.wikipedia.org"
ruby-1.9.1-p378 > uri.host.split('.')
=> ["en", "wikipedia", "org"]
Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".
/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.
/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.
yes
Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)
perl:
$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;
ruby:
url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')
php:
$url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);
Of course, for this particular example, you don't even need a regex, just this will do:
url = 'wikipedia.org'
but I jest...
you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.
In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:
domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')
Hope this helps
Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html
Sure all you would have to do is search on http://(.*)/wiki/Anarchy
In Perl (Sorry I don't know Ruby, but I expect it's similar)
$string_to_search =~ s/http:////(.)//. should give you wikipedia.org
to get rid of the en, you can simply search on http:////en(.)//......
That should do it.
Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.
Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org

Convert from wiki to html

I'm using a wikipedia api for getting info from wikipedia.
Is there anything for convert wiki text in html?
I've tried mediacloth but i doesn't works well
Take a look at marker.
>> require 'marker'
>> m = Marker.parse "== heading ==\nparagraph with '''bold''' text"
>> puts m.to_html
<h2>heading</h2>
<p>paragraph with <b>bold</b> text</p>
Try also wikicloth http://code.google.com/p/wikicloth/ it implements some things that others haven't like tables.
You could download a static HTML dump of Wikipedia.

Resources