Regular expression to remove only beginning and end html tags from string? - ruby-on-rails

I would like to remove for example <div><p> and </p></div> from the string below. The regex should be able to remove an arbitrary number of tags from the beginning and end of the string.
<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>
I have been tinkering with rubular.com without success. Thanks!

def remove_html_end_tags(html_str)
html_str.match(/\<(.+)\>(?!\W*\<)(.+)\<\/\1\>/m)[2]
end
I'm not seeing the problem of \<(.+)> consuming multiple opening tags that Alan Moore pointed out below, which is odd because I agree it's incorrect. It should be changed to \<([^>\<]+)> or something similar to disambiguate.
def remove_html_end_tags(html_str)
html_str.match(/\<([^\>\<]+)\>(?!\W*?\<)(.+)\<\/\1\>/m)[2]
end
The idea is that you want to capture everything between the open/close of the first tag encountered that is not followed immediately by another tag, even with spaces between.
Since I wasn't sure how (with positive lookahead) to say give me the first key whose closing angle bracket is followed by at least one word character before the next opening angle bracket, I said
\>(?!\W*\<)
find the closing angle bracket that does not have all non-word characters before the next open angle bracket.
Once you've identified the key with that attribute, find its closing mate and return the stuff between.
Here's another approach. Find tags scanning forward and remove the first n. Would blow up with nested tags of the same type, but I wouldn't take this approach for any real work.
def remove_first_n_html_tags(html_str, skip_count=0)
matches = []
tags = html_str.scan(/\<([\w\s\_\-\d\"\'\=]+)\>/).flatten
tags.each do |tag|
close_tag = "\/%s" % tag.split(/\s+/).first
match_str = "<#{tag}>(.+)<#{close_tag}>"
match = html_str.match(/#{match_str}/m)
matches << match if match
end
matches[skip_count]
end

Still involves some programming:
str = '<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>'
while (m = /\A<.+?>/.match(str)) && str.end_with?('</' + m[0][1..-1])
str = str[m[0].size..-(m[0].size + 2)]
end
Cthulhu you out there?

I am going to go ahead and answer my own question. Below is the programmatic route:
The input string goes into the first loop as an array in order to remove the front tags. The resulting string is looped through in reverse order in order to remove the end tags. The string is then reversed in order to put it in the correct order.
def remove_html_end_tags(html_str)
str_no_start_tag = ''
str_no_start_and_end_tag = ''
a = html_str.split("")
i= 0
is_text = false
while i <= (a.length - 1)
if (a[i] == '<') && !is_text
while (a[i] != '>')
i+= 1
end
i+=1
else
is_text = true
str_no_start_tag << a[i]
i+=1
end
end
a = str_no_start_tag.split("")
i= a.length - 1
is_text = false
while i >= 0
if (a[i] == '>') && !is_text
while (a[i] != '<')
i-= 1
end
i-=1
else
is_text = true
str_no_start_and_end_tag << a[i]
i-=1
end
end
str_no_start_and_end_tag.reverse!
end

(?:\<div.*?\>\<p.*?\>)|(?:\<\/p\>\<\/div\>) is the expression you need. But this doesn't check for every scenario... if you are trying to parse any possible combination of tags, you may want to look at other ways to parse.
Like for example, this expression doesn't allow for any whitespace between the div and p tag. So if you wanted to allow for that, you would add \s* inbetween the \>\< sections of the tag like so: (?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>).
The div tag and the p tag are expected to be lowercase, as the expression is written. So you may want to figure out a way to check for upper or lower case letters for each, so that Div or dIV would be found too.
Use gskinner's RegEx tool for testing and learning Regular Expressions.
So your end ruby code should look something like this:
# Ruby sample for showing the use of regular expressions
str = "<div><p>text to <span class=\"test\">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>"
puts 'Before Reguar Expression: "', str, '"'
str.gsub!(/(?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>)/, "")
puts 'After Regular Expression', str
system("pause")
EDIT: Replaced div*? to div.*? and replaced p*? to p.*? per suggestions in the comments.
EDIT: This answer doesn't allow for any set of tags, just the two listed in the first line of the question.

Related

simpler way to modify a string

I recently solved this problem, but felt there is a simpler way to do it. I'd like to use fewer lines of code than I am now. I'm new to ruby so if the answer is simple I'd love to add it to my toolbag. Thank you in advance.
goal: accept a word as an arg, and return the word with it's last vowel removed, if no vowels - return the original word
def hipsterfy(word)
vowels = "aeiou"
i = word.length - 1
while i >= 0
if vowels.include?(word[i])
return word[0...i] + word[i+1..-1]
end
i -= 1
end
word
end
try this regex magic:
def hipsterfy(word)
word.gsub(/[aeiou](?=[^aeiou]*$)/, "")
end
how does it work?
[aeiou] looks for a vowel., and ?=[^aeiou]*$ adds the constraint "where there is no vowel match in the following string. So the regex finds the last vowel. Then we just gsub the matched (last vowel) with "".
You could use rindex to find the last vowel's index and []= to remove the corresponding character:
def hipsterfy(word)
idx = word.rindex(/[aoiou]/)
word[idx] = '' if idx
word
end
The if idx is needed because rindex returns nil if no vowel is found. Note that []= modifies word.
There's also rpartition which splits the string at the given pattern, returning an array containing the part before, the match and the part after. By concat-enating the former and latter, you can effectively remove the middle part: (i.e. the vowel)
def hipsterfy(word)
before, _, after = word.rpartition(/[aoiou]/)
before.concat(after)
end
This variant returns a new string, leaving word unchanged.
Another common approach when dealing with some last occurrence is to reverse the string so you can deal with a first occurrence instead (which is usually simpler). Here, you can utilize sub:
def hipsterfy(word)
word.reverse.sub(/[aeiou]/, '').reverse
end
Here is another way to do it.
Reverse the characters of the string
Use find_index to get the first vowel location in this reversed string
Delete the character at this index
Un-reverse the characters and join them back together.
reverse_chars = str.chars.reverse
vowel_idx = reverse_chars.find_index { |char| char =~ /[aeiou]/ }
reverse_chars.delete_at(vowel_idx) if vowel_idx
result = reverse_chars.reverse.join

ruby on rails regular expressions

In my Rails application i have a generic search to display the matching results. What I have done to produce matching results is to replace blank spaces by "%" symbol. Its working perfectly but only if there is a gap between the search term . If I enter a single word it says "no matching string".
class TweetsController<ApplicationController
def index
city = params[:show]
search_term = params[:text]
search_term[" "] = "%"
city_coordinates = Coordinates.where('city=?', city)
#tweets = if (city_coordinates.count == 1 && city_coordinates.first.valid_location?)
Tweets.for_coordinates(city_coordinates.first) & Tweets.where("tweet_text LIKE?" ,"%#{search_term}%").all
else if (Coordinates.count != 1 )
Tweets.for_user_location(city) & Tweets.where("tweet_text LIKE ?" , "%#{search_term}%").all
else
#tweets = Tweets.where("%tweet_text% LIKE ? ", "%#{search_term}%").all
end
end
end
end
I am getting output only if I type two words like "Harbhajan Singh", "VVS Laxman" . If I type a single word its saying no matching strings. Anybody help me with this. I need the output both ways the user enters single word or two words or more .Anybody help me with this.
Probably, you are getting an
IndexError: string not matched
Thats because when there is a single word coming in params[:text], this code
search_term[" "] = "%"
raises the error.
You might want to read the string documentation for more details. It states:
If the regular expression or string is used as the index doesn’t match a position in the string, IndexError is raised.
Hope this helps.
I'm not too great with regular expressions myself, so I usually turn to Rubular. It helps you build and test regular expressions for Ruby.

Get position of found string using Regexp?

I have a big text and I'd like to remove everything before a certain string.
The problem is, there are several occurrences of that string in the text, and I want to decide which one is correct by later analyzing the found piece of text.
I can't include that analysis in a regular expression because of its complexity:
text = <<HERE
big big text
goes here
HERE
pos = -1
a = text.scan(/some regexp/im)
a.each do |m|
s = m[0]
# analysis of found string
...
if ( s is good ) # is the right candidate
pos = ??? # here I'd like to have a position of the found string in the text.
end
end
result_text = text[pos..-1]
$~.offset(n) will give the position of the n-th part of a match.
I think you should count how many occurrences there are in your big string then use index to cut off all the occurrences that do not match the final pattern.

Adding a if statement based on number of words in a string

I managed to cobble together this statement based on lots of help and copying and pasting. It basically returns the first x number of words in a string and im using it as a helper in my app.
Could someone please help me understand how I would add a condition to say if the actual string is less than x words don't add the finishing bit (which is a ...). So in the equation below I' like the 'finish' section to only be added if they are more than the number of words passed into the equation.
def first_x_words(str,n=20,finish='…')
str.split(' ')[0,n].inject{|sum,word| sum + ' ' + word} + finish
end
Actually - if I could make it more complicated, is it possible, after I find a condition where there are less than x words, to check to see if the last 4 characters are </p> and if they are, remove them.
Thanks,
Adam
This should do what you're looking for:
def first_x_words(str, n = 20, finish = '…')
# By default, Ruby will split on whitespace, so no
# argument needs to be passed.
words = str.split
# Rebuild 'n' words into a new string.
truncated = words[0..n-1].inject do |sum, word|
sum << ' ' << word
end
# Either append a finishing string or remove any
# trailing '</p>' tag.
if words.length > n
truncated << finish
else
truncated.chomp!("</p>")
end
# Return the completed string.
truncated
end
It's messy but if you really want to do it
def first_x_words(str,n=20,finish='…')
# make finish blank if the text is short enough
finish = '' if str.split(' ').count < n
str.split(' ')[0,n].inject{|sum,word| sum + ' ' + word} + finish
# remove trailing </p> if any
str.chomp('</p>')
end
I added one line of code before and after your original code so you can hopefully understand it better.

counting line numbers of a poem with nokogiri / ruby

I've struggling to try to do this with a simple regex but it's never been very accurate. It doesn't have to be perfect.
Source has a combination of and tags. I don't want to count blank lines.
Old way:
self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1
New way (not working:
Tries to turn all the + into paragraphs, then throw it into nokogiri to count paragraph tags with more than 3 chars in them (I have no idea how? Counting 1 letter lines would be nice too, but this worked ok in javascript)
h = rendered
h.gsub!(/<br>\s*<br>/gi,"<p>")
h.gsub!(/<br>/gi,"<p>") if h =~ /<br>\s*<br>/
h.prepend "<p>" if !h =~ /^\s*<p[^>]*>/i
h.replace(/<p>\s*<p>/g,"<p> </p><p>")
Nokogiri::HTML(rendered)
# find+count p tags with at least 1-3 chars?
# this is javascript not ruby, but you get the idea
$('p', c).each(function(i) { // had to trim it to remove whitespaces from start/end.
if ($(this).children('img').length) return; // skip if it's just an image.
if ($.trim($(this).text()).length > 3)
$(this).append("<div class='num'>"+ (n += 1) +"</div>");
})
Other methods are welcome!
Example poem ( http://allpoetry.com/poem/7429983-the_many_endings-by-Kevin )
<p>
from the other side of silence<br>
you met me with change and a pocket<br>
of unhappy apples.</p>
<p>
</p>
<p>
<br>
we bled together to black<br>
and chose the path carefully to<br>
france.<br><br>
sometimes when you smile<br>
your radiant footsteps fall<br>
and all around us is silence:<br>
each dream step is<br>
false but full of such glory</p>
<p>
</p>
<p>
<br>
unhappiness never made a student of you:<br>
just two by two by two. now three<br>
this great we that overflows our<br>
heart-cave<br><br>
each jewel-like addition to the delicate<br>
crown. but flowers fall and dreams,<br>
all dreams, come to and end with death.</p>
Thank you!
For posterity, here's what I'm using now and it seems to be quite accurate. Non latin chars cause some problems sometimes from ckeditor, so I'm stripping them out for now.
html = Nokogiri::HTML(rendered)
text = html.at('body').inner_text rescue nil
return self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1 if !text
#bonus points to strip lines entirely non-letter. idk
#d "text is", text.gsub!(/([\x09|\x0D|\t])|(\xc2\xa0){1,}|[^A-z]/u,'')
text.gsub!(/[^A-z\n]/u,'')
#d "text is", text
self.words = text.strip.scan(/(\s*\n\s*)+/).size+1

Resources