SAX parsing a bunch of dead presidents with Nokogiri HTML parser? - ruby-on-rails

I would like to parse USA presidents on the "List of Presidents of the United States" wiki page.
I can do this with a bunch of XPath and loops. But SAx parsing is so fast and I would like to learn how to implement that.
The Nokogiri document gave me an HTML SAX parsing example:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attributes = []
puts "found a #{name}"
end
end
parser = Nokogiri::HTML::SAX::Parser.new(MyDoc.new)
parser.parse(File.read(ARGV[0], 'rb'))
But which methods do I use to define all the HTML elements and their content that I want to grab?

With SAX, you have to define callback methods in your parser for each 'event'. You have to keep track of state yourself. It is very crude. For example, to get president names from the page, you can do this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attributes = []
if name == "li"
#inside_li = true
end
end
def characters(chars)
if #inside_li
puts "found an <li> containing the string '#{chars}'"
end
end
def end_element name
if name == "li"
puts "ending #{name}"
#inside_li = false
end
end
end
The above can be thought of as the rough equivalent of the statement:
doc.xpath('//li').map(&:text)
Which starts with the following output:
ending li
found an <li> containing the string 'Grover Cleveland'
ending li
found an <li> containing the string 'William McKinley'
ending li
found an <li> containing the string 'Theodore Roosevelt'
So far so good, However, it also outputs a lot of cruft, ending with:
found an <li> containing the string 'Disclaimers'
ending li
found an <li> containing the string 'Mobile view'
ending li
found an <li> containing the string '
'
found an <li> containing the string '
'
ending li
found an <li> containing the string '
'
found an <li> containing the string '
'
ending li
So to make this more precise and not get the li elements you don't care about, you'd have to keep track of which container elements you are in by adding more if clauses to start_element, characters, etc. And if you have nested elements of the same name, you'll have to keep track of counters yourself, or implement a stack to push and pop the elements you see. It gets VERY messy very fast.
SAX is best for filters where you don't care about the DOM, you're just doing some basic transformations.
Instead, consider using a single XPath statement, such as
doc.xpath("//table[contains(.//div, 'Presidents of the United States')]//ol/li").map(&:text)
This says "Find the table which contains a div with the words 'Presidents of the United States' and return the text from all the ordered list items within it". This can be done in SAX, but it would be a lot of messy code.
Output of the above XPath:
["George Washington", "John Adams", "Thomas Jefferson", "James Madison", "James Monroe", "John Quincy Adams", "Andrew Jackson", "Martin Van Buren", "William Henry Harrison", "John Tyler", "James K. Polk", "Zachary Taylor", "Millard Fillmore", "Franklin Pierce", "James Buchanan", "Abraham Lincoln", "Andrew Johnson", "Ulysses S. Grant", "Rutherford B. Hayes", "James A. Garfield", "Chester A. Arthur", "Grover Cleveland", "Benjamin Harrison", "Grover Cleveland", "William McKinley", "Theodore Roosevelt", "William Howard Taft", "Woodrow Wilson", "Warren G. Harding", "Calvin Coolidge", "Herbert Hoover", "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower", "John F. Kennedy", "Lyndon B. Johnson", "Richard Nixon", "Gerald Ford", "Jimmy Carter", "Ronald Reagan", "George H. W. Bush", "Bill Clinton", "George W. Bush", "Barack Obama"]

Related

How to replace Space with Line Break in Ruby on Rails?

I am trying to replace Space in a string with Line Break in Ruby on Rails,
name = 'john smith'
i have tried the following so far:
name.gsub!(" ", "\n")
name.gsub!(" ", "<br>")
name.sub(" ", "\n")
name.sub(" ", "<br>")
but none of the above worked.
You have to be careful when marking a string as html_safe, especially if it may contain user input:
name = 'john smith<script>alert("gotcha")</script>'
name.gsub(' ', '<br>').html_safe
#=> "john<br>smith<script>alert(\"gotcha\")</script>"
Rails would output that string as-is, i.e. including the <script> tag.
In order to take advantage of Rails' HTML escaping, you should only mark the trusted parts as being html_safe. For a manually concatenated string:
''.html_safe + 'john' + '<br>'.html_safe + 'smith<script>alert("gotcha")</script>'
#=> "john<br>smith<script>alert("gotcha")</script>"
As you can see, only the <br> tag was left intact, the remaining parts were properly escaped.
There are several helpers for building safe strings as well as for building HTML tags. In your case, I'd use safe_join and tag:
name = 'john smith<script>alert("gotcha")</script>'
safe_join(name.split(' '), tag(:br))
#=> "john<br />smith<script>alert("gotcha")</script>"
While printing it in html you will need to use raw, otherwise rails will escape the tags
= raw name.gsub(" ", "<br>")
Try another one:
<%= name.gsub(" ", "<br>").html_safe %>
html_safe :
Marks a string as trusted safe. It will be inserted into HTML with no additional escaping performed.
"<a>Hello</a>".html_safe
#=> "<a>Hello</a>"
nil.html_safe
#=> NoMethodError: undefined method `html_safe' for nil:NilClass
raw :
raw is just a wrapper around html_safe. Use raw if there are chances that the string will be nil.
raw("<a>Hello</a>")
#=> "<a>Hello</a>"
raw(nil)
#=> ""

How to remove all the <br> from the first paragraph and last using regex

I am trying to get rid of all the extra <br> in the first paragraph and last paragraph.
For example:
st = "<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>"
I'm hoping to end up with this:
"<p>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry</p>"
My goal is to leave the <br> middle paragraphs (ex. orange paragraph) alone and remove all the first paragraph <br> and all the end the last paragraph.
I've tried doing this regex:
st.sub(/^((<p>)|<br( \/)?>)*|(<p>|<br( \/)?>|< \/p>)*$/, '')
I get this:
=> "<p>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>"
I am unable to delete the last paragraph repeating <br>.
Don't use regular expressions. Instead use a parser:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT
p_tags = doc.search('p')
[:first, :last].each { |s| p_tags.send(s).search('br').remove }
doc.to_html
Which would result in the fragment looking like:
# => "<p>apple</p>\n" +
# "<p>bananas</p>\n" +
# "<p>orange<br><br><br><br><br></p>\n" +
# "<p>tomatoes</p>\n" +
# "<p>berry</p>\n"
Parsers are much more able to cope with changing HTML so if you're going to do any HTML changes or scraping it pays off to learn how to use them.
An alternate way to do what you want without a parser or a complicated regex is:
str = <<EOT
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT
str_lines = str.lines
[0, -1].each { |i| str_lines[i].gsub!(/<br>/, '') }
puts str_lines.join
Which results in the same thing.
The strength of the first method is that it won't care if the <br> mysteriously change to <br/> as in HTML5, or <br >.
Finally, if you doubly insist on using a longer, more complicated, pattern, at least simplify it:
puts str.sub(/\A<p>(?:<br>)+/, '<p>').sub(/(?:<br>)+<\/p>\Z/, '</p>')
which results in the same thing again.
Regular expressions are great for some tasks, but they're not good for markup. If you insist on using a regular expression, then simplify the problem as in the later solutions because it reduces the complexity of the pattern, which improves readability and eases maintenance.
st = st.gsub(/(?<=\A<p>)(<br\/?>)+|(<br\/?>)+(?=[<]\/p>\Z)/, '')
There's 2 parts seperated by a pipe (OR):
1) (?<=\A<p>)(<br\/?>)+ matches 1 or more <br> that are after the start of the string (\A) and a <p> tag
2) (<br\/?>)+(?=[<]\/p>\Z) matches matches 1 or more <br> that are before a </p> closing tag at the end of the string (\Z)
And gsub because we want to replace all occurrences in the string, not just the first.
The g in gsub stands for global.
I suggest something simple that's easy to understand, test and maintain.
str =<<-_
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
_
#=> "<p><br><br><br><br>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry<br><br><br><br><br><br></p>\n"
first, *mid, last = str.lines
first.gsub('<br>', '') << mid.join << last.gsub('<br>', '')
#=> "<p>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry</p>\n"
puts s
<p>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry</p>
Note that
first
#=> "<p><br><br><br><br>apple</p>\n"
mid
#=> ["<p>bananas</p>\n",
# "<p>orange<br><br><br><br><br></p>\n",
# "<p>tomatoes</p>\n"]
last
#=> "<p>berry<br><br><br><br><br><br></p>\n"

Rails 4 - manipulate form input and display in view

I have a page where I take input from a form and then display it in a view.
View:
<%= form_tag("/page", method:"get") do %>
<%= text_area_tag(:x, #input)%>
<%= submit_tag("Submit Form") %> <%end%>
<%=#input%>
Controller:
def myMethod
if params[:x].present?
#input = "#{params[:x]}"
end
This works fine however I want to be able to identify where there are spaces in the string and then replace the spaces with a new line, and add a ",". For example, if the user inputs ‘cat dog mouse’ i want the view to return:
'cat',
'dog',
'mouse',
Is there an easy way to do this with a ruby function or will I need to write a regular expressions text search?
Thanks
A simple gsub will do:
"cat dog mouse".gsub(" ", ",\n")
This will replace every occurrence of a space with a comma/newline ,.
Update
Since you want to encapsulate each line with single quotes, a simple way to do it would be:
"cat dog mouse".split # Split the string into an array (automatically splits by space)
.map{|w| "'#{w}'"}} # Reassemble it with single quotes added
.join(",\n") # Convert the array into a string again and insert the comma/newline characters between each entry
That code, of course, can all be written on one line.
Here's another quick way to do this:
string = "cat dog mouse"
new_string = "'" + string.split.join("',\n'") + "'" # Outputs the same as above. Less friendly to read, but is also shorter.
You can use split, map and join:
"cat dog mouse".split(" ").map {|a| "'#{a}',\n"}.join
split creates a list ["cat", "dog", "mouse"]
map transforms it ["'cat',\n", "'dog',\n", "'mouse',\n"]
join creates a string again "'cat',\n'dog',\n'mouse',\n"

How do I take a substring of a string with multiple quotes? Rails/Ruby

Given a string such as (Shift opened: \"he clams \"sick\" but not sure\") how would I extract just the part between the first set of quotes? I've tried combinations of split, slice and squeeze but always run into a case where it doesn't work. Thanks.
EDIT: The user inputs text, which can be in any form, so yes, someone could have an odd number of quotes. The text before the input is generated for record purposes. Some examples:
n = (Shift opened: \"he clams \"sick\" but not sure\")
n.split('"')[1] > "he claims "
If I could find the size of the array created by split I could do split('"')[1..size-1] but I'm not sure how to find that.
n = (Shift opened: \"\"sick\"\")
n.squeeze('"').split('"')[1] >> "sick"
That works fine.
This is more for error checking and making sure if people use quotes on input, it doesn't mess things up. And no I cannot edit how the string is generated. Hope I'm clear enough!
You can leverage the fact that regex is greedy by default, and use /"(.*)", which will capture all text between the first and last quotes:
n = 'Shift opened: "he clams "sick" but not sure" some more text'
n[/"(.*)"/, 1]
# => "he clams "sick" but not sure"
n = "Shift opened: \"\"sick\"\""
n[/"(.*)"/, 1]
# => ""sick""
I'm not sure if you want to extract the text on quotes recursively and get something like this:
=> "he clams "sick" but not sure"
=> "sick"
or "lorem ipsum "xxxxx yyyy "alpha beta" zzzz wwww" dol"
=> "lorem ipsum "xxxxx yyyy "alpha beta" zzzz wwww" dol"
=> "xxxxx yyyy "alpha beta" zzzz wwww"
=> "alpha beta"
perhaps you will need a simple CFG:
S -> aS | a
a = /\".*\"/
or iterate the string stacking substrings on each quote

Split string by whitespace in Rails unexpected behavior

I want to split a string by whitespace
irb(main):001:0> input = "dog cat"
=> "dog cat"
irb(main):002:0> output = input.strip.split(/\s+/)
=> ["dog", "cat"]
This is good. However, I'm also doing this in the controller in Rails, and when I supply the same input, and have it print out the output #{output} into my view, it shows as dogcat instead of ["dog", "cat"]. I am really confused how that can happen. Any ideas?
I'm printing it using #notice = "#{output}" in the controller, and in my view I have <%= #notice %>
Rather than splitting your string in the controller and sending it as an array to your view, send the entire string to your view:
input = "dog cat"
#notice = input
Then, in your view, split your the string and display it as a stringified-array:
<%= array(#notice.strip.split(/\s+/)).to_s %>
If you print an array of strings, you'll get the strings all concatenated together. You'd get the same thing in irb if you had entered, print "#{output}". You need to decide how you want to format them and print them that way, perhaps with a simple helper function. For example, the helper could do:
output.each { |s| puts "<p>#{s}</p>" }
Or whatever you like.
Continuing with your example code:
>> input = "dog cat"
=> "dog cat"
>> output = input.strip.split /\s+/
=> ["dog", "cat"]
>> joined = output.join ' '
=> "dog cat"
Remember too that Ruby has several helpers like %w and %W for letting you convert a string into an array of words. If you're starting with an array of words, each of which may have whitespace before and after its individual item, you might try something like this:
>> # `input` is an array of words that was populated Somewhere Else
>> # `input` has the initial value [" dog ", "cat\n", "\r tribble\t"]
>> output = input.join.split /\s+/
=> ["dog", "cat", "tribble"]
>> joined = output.join ' '
=> "dog cat tribble"
Calling String#join without any parameter will join stringish array items together with no separation between them, and is what seems to be done in your example where you just render the array as a string
>> #notice = output
>> # #notice will render as 'dogcat'
As opposed to:
>> #notice = input.join.split(/\s+/).join ' '
>> # #notice will render as 'dog cat'
And there you go.

Resources