counting line numbers of a poem with nokogiri / ruby - nokogiri

I've struggling to try to do this with a simple regex but it's never been very accurate. It doesn't have to be perfect.
Source has a combination of and tags. I don't want to count blank lines.
Old way:
self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1
New way (not working:
Tries to turn all the + into paragraphs, then throw it into nokogiri to count paragraph tags with more than 3 chars in them (I have no idea how? Counting 1 letter lines would be nice too, but this worked ok in javascript)
h = rendered
h.gsub!(/<br>\s*<br>/gi,"<p>")
h.gsub!(/<br>/gi,"<p>") if h =~ /<br>\s*<br>/
h.prepend "<p>" if !h =~ /^\s*<p[^>]*>/i
h.replace(/<p>\s*<p>/g,"<p> </p><p>")
Nokogiri::HTML(rendered)
# find+count p tags with at least 1-3 chars?
# this is javascript not ruby, but you get the idea
$('p', c).each(function(i) { // had to trim it to remove whitespaces from start/end.
if ($(this).children('img').length) return; // skip if it's just an image.
if ($.trim($(this).text()).length > 3)
$(this).append("<div class='num'>"+ (n += 1) +"</div>");
})
Other methods are welcome!
Example poem ( http://allpoetry.com/poem/7429983-the_many_endings-by-Kevin )
<p>
from the other side of silence<br>
you met me with change and a pocket<br>
of unhappy apples.</p>
<p>
</p>
<p>
<br>
we bled together to black<br>
and chose the path carefully to<br>
france.<br><br>
sometimes when you smile<br>
your radiant footsteps fall<br>
and all around us is silence:<br>
each dream step is<br>
false but full of such glory</p>
<p>
</p>
<p>
<br>
unhappiness never made a student of you:<br>
just two by two by two. now three<br>
this great we that overflows our<br>
heart-cave<br><br>
each jewel-like addition to the delicate<br>
crown. but flowers fall and dreams,<br>
all dreams, come to and end with death.</p>
Thank you!

For posterity, here's what I'm using now and it seems to be quite accurate. Non latin chars cause some problems sometimes from ckeditor, so I'm stripping them out for now.
html = Nokogiri::HTML(rendered)
text = html.at('body').inner_text rescue nil
return self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1 if !text
#bonus points to strip lines entirely non-letter. idk
#d "text is", text.gsub!(/([\x09|\x0D|\t])|(\xc2\xa0){1,}|[^A-z]/u,'')
text.gsub!(/[^A-z\n]/u,'')
#d "text is", text
self.words = text.strip.scan(/(\s*\n\s*)+/).size+1

Related

Show only first couple of paragraphs. Ruby on Rails

I am using simple_format(#company.description) to make a text into a number of paragraphs. On one page I only want to show, lets say, the first three paragraphs. Is there any nifty ruby method that can help out here or should I read up on my regular expression skills, which are very out of date.
Cheers Carl
The truncate helper method takes an optional :separator. Since simple_format uses <br /> tags to indicate paragraphs, you should be able to do something like this:
truncate(simple_format(#company.description), length: 3, separator: '<br />')
Personally, I'd round the text before putting it into a formatter. I'd assume you have a \r\n or \n in your text. Why format 1000 paragraphs if all you need is the first 3.
def get_first_paragraphs(text, desired = 3)
text.each_with_index do |character, index|
if(character == "\n")
desiredParCount -= 1
return text[0..index] if(desiredParCount <= 0)
end
end
return text;
end
It's a function sure, but it's the fastest method.. if speed is important. (it always is, ALWAYS!) :)

extracting runs of text with Mechanize/Nokogiri

Is there a sensible way to extract each run of text in a Mechanize-parsed HTML document, so that (for example):
<p>Here is <b>some</b> text<p>
is broken into three elements:
Here is
some
text
? My hunch is that there's a simple technique using recursive CSS search and/or #flatten, but I've not figured it out yet.
Borrowing from an answer in "Nokogiri recursively get all children":
result = []
doc.traverse { |node| result << node.text if node.text? }
That should give you the array ["Here is ", "some", " text"].
"Getting Mugged by Nokogiri" discusses traverse.
Since you want the contents of each text node, you can do this:
doc.search('//text()').map(&:text)
The only downside to this (and to the other answer) is that you get all the whitespace between elements as well. If you want to suppress this, you can do this:
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
This removes all elements that don't contain a word character.

Regular expression to remove only beginning and end html tags from string?

I would like to remove for example <div><p> and </p></div> from the string below. The regex should be able to remove an arbitrary number of tags from the beginning and end of the string.
<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>
I have been tinkering with rubular.com without success. Thanks!
def remove_html_end_tags(html_str)
html_str.match(/\<(.+)\>(?!\W*\<)(.+)\<\/\1\>/m)[2]
end
I'm not seeing the problem of \<(.+)> consuming multiple opening tags that Alan Moore pointed out below, which is odd because I agree it's incorrect. It should be changed to \<([^>\<]+)> or something similar to disambiguate.
def remove_html_end_tags(html_str)
html_str.match(/\<([^\>\<]+)\>(?!\W*?\<)(.+)\<\/\1\>/m)[2]
end
The idea is that you want to capture everything between the open/close of the first tag encountered that is not followed immediately by another tag, even with spaces between.
Since I wasn't sure how (with positive lookahead) to say give me the first key whose closing angle bracket is followed by at least one word character before the next opening angle bracket, I said
\>(?!\W*\<)
find the closing angle bracket that does not have all non-word characters before the next open angle bracket.
Once you've identified the key with that attribute, find its closing mate and return the stuff between.
Here's another approach. Find tags scanning forward and remove the first n. Would blow up with nested tags of the same type, but I wouldn't take this approach for any real work.
def remove_first_n_html_tags(html_str, skip_count=0)
matches = []
tags = html_str.scan(/\<([\w\s\_\-\d\"\'\=]+)\>/).flatten
tags.each do |tag|
close_tag = "\/%s" % tag.split(/\s+/).first
match_str = "<#{tag}>(.+)<#{close_tag}>"
match = html_str.match(/#{match_str}/m)
matches << match if match
end
matches[skip_count]
end
Still involves some programming:
str = '<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>'
while (m = /\A<.+?>/.match(str)) && str.end_with?('</' + m[0][1..-1])
str = str[m[0].size..-(m[0].size + 2)]
end
Cthulhu you out there?
I am going to go ahead and answer my own question. Below is the programmatic route:
The input string goes into the first loop as an array in order to remove the front tags. The resulting string is looped through in reverse order in order to remove the end tags. The string is then reversed in order to put it in the correct order.
def remove_html_end_tags(html_str)
str_no_start_tag = ''
str_no_start_and_end_tag = ''
a = html_str.split("")
i= 0
is_text = false
while i <= (a.length - 1)
if (a[i] == '<') && !is_text
while (a[i] != '>')
i+= 1
end
i+=1
else
is_text = true
str_no_start_tag << a[i]
i+=1
end
end
a = str_no_start_tag.split("")
i= a.length - 1
is_text = false
while i >= 0
if (a[i] == '>') && !is_text
while (a[i] != '<')
i-= 1
end
i-=1
else
is_text = true
str_no_start_and_end_tag << a[i]
i-=1
end
end
str_no_start_and_end_tag.reverse!
end
(?:\<div.*?\>\<p.*?\>)|(?:\<\/p\>\<\/div\>) is the expression you need. But this doesn't check for every scenario... if you are trying to parse any possible combination of tags, you may want to look at other ways to parse.
Like for example, this expression doesn't allow for any whitespace between the div and p tag. So if you wanted to allow for that, you would add \s* inbetween the \>\< sections of the tag like so: (?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>).
The div tag and the p tag are expected to be lowercase, as the expression is written. So you may want to figure out a way to check for upper or lower case letters for each, so that Div or dIV would be found too.
Use gskinner's RegEx tool for testing and learning Regular Expressions.
So your end ruby code should look something like this:
# Ruby sample for showing the use of regular expressions
str = "<div><p>text to <span class=\"test\">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>"
puts 'Before Reguar Expression: "', str, '"'
str.gsub!(/(?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>)/, "")
puts 'After Regular Expression', str
system("pause")
EDIT: Replaced div*? to div.*? and replaced p*? to p.*? per suggestions in the comments.
EDIT: This answer doesn't allow for any set of tags, just the two listed in the first line of the question.

Interpret newlines as <br>s in markdown (Github Markdown-style) in Ruby

I'm using markdown for comments on my site and I want users to be able to create line breaks by pressing enter instead of space space enter (see this meta question for more details on this idea)
How can I do this in Ruby? You'd think Github Flavored Markdown would be exactly what I need, but (surprisingly), it's quite buggy.
Here's their implementation:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This logic requires that the line start with a \w for a linebreak at the end to create a <br>. The reason for this requirement is that you don't to mess with lists: (But see the edit below; I'm not even sure this makes sense)
* we don't want a <br>
* between these two list items
However, the logic breaks in these cases:
[some](http://google.com)
[links](http://google.com)
*this line is in italics*
another line
> the start of a blockquote!
another line
I.e., in all of these cases there should be a <br> at the end of the first line, and yet GFM doesn't add one
Oddly, this works correctly in the javascript version of GFM.
Does anyone have a working implementation of "new lines to <br>s" in Ruby?
Edit: It gets even more confusing!
If you check out Github's official Github Flavored Markdown repository, you'll find yet another newline to <br> regex!:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/(\A|^$\n)(^\w[^\n]*\n)(^\w[^\n]*$)+/m) do |x|
x.gsub(/^(.+)$/, "\\1 ")
end
I have no clue what this regex means, but it doesn't do any better on the above test cases.
Also, it doesn't look like the "don't mess with lists" justification for requiring that lines start with word characters is valid to begin with. I.e., standard markdown list semantics don't change regardless of whether you add 2 trailing spaces. Here:
item 1
item 2
item 3
In the source of this question there are 2 trailing spaces after "item 1", and yet if you look at the HTML, there is no superfluous <br>
This leads me to think the best regex for converting newlines to <br>s is just:
text.gsub!(/^[^\n]+\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
Thoughts?
I'm not sure if this will help, but I just use simple_format()
from ActionView::Helpers::TextHelper
ActionView simple_format
my_text = "Here is some basic text...\n...with a line break."
simple_format(my_text)
output => "<p>Here is some basic text...\n<br />...with a line break.</p>"
Even if it doesn't meet your specs, looking at the simple_format() source code .gsub! methods might help you out writing your own version of required markdown.
A little too late, but perhaps useful for other people. I've gotten it to work (but not thoroughly tested) by preprocessing the text using regular expressions, like so. It's hideous as a result of the lack of zero-width lookbehinds, but oh well.
# Append two spaces to a simple line, if it ends in newline, to render the
# markdown properly. Note: do not do this for lists, instead insert two newlines. Also, leave double newlines
# alone.
text.gsub! /^ ([\*\+\-]\s+|\d+\s+)? (.+?) (\ \ )? \r?\n (\r?\n|[\*\+\-]\s+|\d+\s+)? /xi do
full, pre, line, spaces, post = $~.to_a
if post != "\n" && pre.blank? && post.blank? && spaces.blank?
"#{pre}#{line} \n#{post}"
elsif pre.present? || post.present?
"#{pre}#{line}\n\n#{post}"
else
full
end
end

How can I delete special characters?

I'm practicing with Ruby and regex to delete certain unwanted characters. For example:
input = input.gsub(/<\/?[^>]*>/, '')
and for special characters, example ☻ or ™:
input = input.gsub('&#', '')
This leaves only numbers, ok. But this only works if the user enters a special character as a code, like this:
™
My question:
How I can delete special characters if the user enters a special character without code, like this:
™ ☻
First of all, I think it might be easier to define what constitutes "correct input" and remove everything else. For example:
input = input.gsub(/[^0-9A-Za-z]/, '')
If that's not what you want (you want to support non-latin alphabets, etc.), then I think you should make a list of the glyphs you want to remove (like ™ or ☻), and remove them one-by-one, since it's hard to distinguish between a Chinese, Arabic, etc. character and a pictograph programmatically.
Finally, you might want to normalize your input by converting to or from HTML escape sequences.
If you just wanted ASCII characters, then you can use:
original = "aøbauhrhræoeuacå"
cleaned = ""
original.each_byte { |x| cleaned << x unless x > 127 }
cleaned # => "abauhrhroeuac"
You can use parameterize:
'#!#$%^&*()111'.parameterize
=> "111"
You can match all the characters you want, and then join them together, like this:
original = "aøbæcå"
stripped = original.scan(/[a-zA-Z]/).to_s
puts stripped
which outputs "abc"
An easier way to do this inspirated by Can Berk Güder answer is:
In order to delete special characters:
input = input.gsub(/\W/, '')
In order to keep word characters:
input = input.scan(/\w/)
At the end input is the same! Try it on : http://rubular.com/

Resources