How to have gsub handle multiple patterns and replacements - ruby-on-rails

A while ago I created a function in PHP to "twitterize" the text of tweets pulled via Twitter's API.
Here's what it looked like:
function twitterize($tweet){
$patterns = array ( "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/",
"/(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/",
"/(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/");
$replacements = array ("<a href='\\0' target='_blank'>\\0</a>", "<a href='http://twitter.com/\\1' target='_blank'>\\0</a>", "<a href='http://twitter.com/search?q=\\1&src=hash' target='_blank'>\\0</a>");
return preg_replace($patterns, $replacements, $tweet);
}
Now I'm a little stuck with Ruby's gsub, I tried:
def twitterize(text)
patterns = ["/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/", "/(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/", "/(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/"]
replacements = ["<a href='\\0' target='_blank'>\\0</a>",
"<a href='http://twitter.com/\\1' target='_blank'>\\0</a>",
"<a href='http://twitter.com/search?q=\\1&src=hash' target='_blank'>\\0</a>"]
return text.gsub(patterns, replacements)
end
Which obviously didn't work and returned an error:
No implicit conversion of Array into String
And after looking at the Ruby documentation for gsub and exploring a few of the examples they were providing, I still couldn't find a solution to my problem: How can I have gsub handle multiple patterns and multiple replacements at once?

Well, as you can read from the docs, gsub does not handle multiple patterns and replacements at once. That's what causing your error, quite explicit otherwise (you can read that as "give me a String, not an Array!!1").
You can write that like this:
def twitterize(text)
patterns = [/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/, /(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/, /(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z_]+[A-Za-z0-9_]+)/]
replacements = ["<a href='\\0' target='_blank'>\\0</a>",
"<a href='http://twitter.com/\\1' target='_blank'>\\0</a>",
"<a href='http://twitter.com/search?q=\\1&src=hash' target='_blank'>\\0</a>"]
patterns.each_with_index do |pattern, i|
text.gsub!(pattern, replacements[i])
end
text
end
This can be refactored into more elegant rubyish code, but I think it'll do the job.

The error was because you tried to use an array of replacements in the place of a string in the gsub function. Its syntax is:
text.gsub(matching_pattern,replacement_text)
You need to do something like this:
replaced_text = text.gsub(pattern1, replacement1)
replaced_text = replaced_text.gsub(pattern2, replacement2)
and so on, where the pattern 1 is one of your matching patterns and replacement is the replacement text you would like.

Related

ruby substring backslash plus "

So, I have a string like this:
str1 = "blablablabla... original_url=\"https://facebook.com/125642\"> ... blablablabla..."
what is the best approach to extract this original_url?
what I have done so far is this:
original_url = str1['content'][str1['content'].index('original_url')+12..str1['content'].index('>')-2]
it works, but it seems such like a poor solution, mostly I'm stuggling to find this substring /">
here's what I have tried so far
str1.index('\">')
str1.index('\\">') # escaping only one backslach
str1.index('\\\">') # escaping both back slash and "
str1.index("\\\">") # was just without idea over here
I'm not a ruby programmer, so I'm kinda lost here
The best approach to parse xml namespaces is to use Nokogiri as suggested by #spickermann.
Quick but not elegant and not even efficient solutions:
str1 = "blablablabla... original_url=\"https://facebook.com/125642\"> ... blablablabla..."
original_url=str1[str1.index("original_url")+14...str1.index("\">")]
# => "https://facebook.com/125642"
original_url=str1.split(/original_url=\"/)[1].split(/">/).first
# => "https://facebook.com/125642"

Multiple `gsub` in multiple `each` loops getting overriden one by another

Trying to iterate over some phrases, and whenever I find a word, I need to replace it with a link.
phrases = ["hello world", "worldwide"]
words_to_link = ["world", "world"]
I am trying to get:
"hello <a href='world'>world</a><br />worldwide"
My code is:
phrases.each do |ph|
words_to_link.each do |w|
ph.gsub!(w, "<a href='#{w}'>#{w}</a>")
end
end.join("<br />").html_safe
The output of this is:
"hello <a href='<a href='world'>world</a>'><a href='world'>world</a></a><br /><a href='<a href='world'>world</a>'><a href='world'>world</a></a>wide"
On the first run it finds all occurrences of world, but on the second, it goes inside the generated world and gsubs again.
Another problem is the proper regex to only find words by boundaries, I thought it would be /\b(word)\b, but that didn't work.
Any pointers?
I'm a little confused by your question, so may have got the wrong end of the stick here. However, here is an answer by my interpretation:
phrases = ["hello world", "worldwide"]
substitutions = { /\bworld\b/ => "world" }
phrases.each do |ph|
substitutions.each do |pattern, replacement|
ph.gsub!(pattern, "<a href='#{replacement}'>#{replacement}</a>")
end
end
phrases.join("<br />").html_safe
You can use \b in a regex to mark a work boundary, to avoid altering the "worldwide" string. And (I think this is what you wanted?) you can define some mapping between the search/replace terms rather than looping though twice, to avoid the double-replacement.

Regex in Ruby: expression not found

I'm having trouble with a regex in Ruby (on Rails). I'm relatively new to this.
The test string is:
http://www.xyz.com/017010830343?$ProdLarge$
I am trying to remove "$ProdLarge$". In other words, the $ signs and anything between.
My regular expression is:
\$\w+\$
Rubular says my expression is ok. http://rubular.com/r/NDDQxKVraK
But when I run my code, the app says it isn't finding a match. Code below:
some_array.each do |x|
logger.debug "scan #{x.scan('\$\w+\$')}"
logger.debug "String? #{x.instance_of?(String)}"
x.gsub!('\$\w+\$','scl=1')
...
My logger debug line shows a result of "[]". String is confirmed as being true. And the gsub line has no effect.
What do I need to correct?
Use /regex/ instead of 'regex':
> "http://www.xyz.com/017010830343?$ProdLarge$".gsub(/\$\w+\$/, 'scl=1')
=> "http://www.xyz.com/017010830343?scl=1"
Don't use a regex for this task, use a tool designed for it, URI. To remove the query:
require 'uri'
url = URI.parse('http://www.xyz.com/017010830343?$ProdLarge$')
url.query = nil
puts url.to_s
=> http://www.xyz.com/017010830343
To change to a different query use this instead of url.query = nil:
url.query = 'scl=1'
puts url.to_s
=> http://www.xyz.com/017010830343?scl=1
URI will automatically encode values if necessary, saving you the trouble. If you need even more URL management power, look at Addressable::URI.

Ruby gsub function

I'm trying to create a BBcode [code] tag for my rails forum, and I have a problem with the expression:
param_string.gsub!( /\[code\](.*?)\[\/code\]/im, '<pre>\1</pre>' )
How do I get what the regex match returns (the text inbetween the [code][/code] tags), and escape all the html and some other characters in it?
I've tried this:
param_string.gsub!( /\[code\](.*?)\[\/code\]/im, '<pre>' + my_escape_function('\1') + '</pre>' )
but it didn't work. It just passes "\1" as a string to the function.
You should take care of the greedy behavior of the regular expressions. So the correct code looks like this:
html.gsub!(/\[(\S*?)\](.*?)\[\/\1\]/) { |m| escape_method($1, $2) }
The escape_method then looks like this:
def escape_method( type, string )
case type.downcase
when 'code'
"<pre>#{string}</pre>"
when 'bold'
"<b>#{string}</b>"
else
string
end
end
Someone here posted an answer, but they've deleted it.
I've tried their suggestion, and made it work with a small change. Whoever you are, thanks! :)
Here it is
param_string.gsub!( /\[code\](.*?)\[\/code\]/im ) {|s| '<pre>' + my_escape_function(s) + '</pre>' }
You can simply use "<pre>#{$1}</pre>" for your replacement value.

Truncate Markdown?

I have a Rails site, where the content is written in markdown. I wish to display a snippet of each, with a "Read more.." link.
How do I go about this? Simple truncating the raw text will not work, for example..
>> "This is an [example](http://example.com)"[0..25]
=> "This is an [example](http:"
Ideally I want to allow the author to (optionally) insert a marker to specify what to use as the "snippet", if not it would take 250 words, and append "..." - for example..
This article is an example of something or other.
This segment will be used as the snippet on the index page.
^^^^^^^^^^^^^^^
This text will be visible once clicking the "Read more.." link
The marker could be thought of like an EOF marker (which can be ignored when displaying the full document)
I am using maruku for the Markdown processing (RedCloth is very biased towards Textile, BlueCloth is extremely buggy, and I wanted a native-Ruby parser which ruled out peg-markdown and RDiscount)
Alternatively (since the Markdown is translated to HTML anyway) truncating the HTML correctly would be an option - although it would be preferable to not markdown() the entire document, just to get the first few lines.
So, the options I can think of are (in order of preference)..
Add a "truncate" option to the maruku parser, which will only parse the first x words, or till the "excerpt" marker.
Write/find a parser-agnostic Markdown truncate'r
Write/find an intelligent HTML truncating function
Write/find an intelligent HTML truncating function
The following from http://mikeburnscoder.wordpress.com/2006/11/11/truncating-html-in-ruby/, with some modifications will correctly truncate HTML, and easily allow appending a string before the closing tags.
>> puts "<p><b>Something</p>".truncate_html(5, at_end = "...")
=> <p><b>Someth...</b></p>
The modified code:
require 'rexml/parsers/pullparser'
class String
def truncate_html(len = 30, at_end = nil)
p = REXML::Parsers::PullParser.new(self)
tags = []
new_len = len
results = ''
while p.has_next? && new_len > 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results << "<#{tags.last}#{attrs_to_s(p_e[1])}>"
when :end_element
results << "</#{tags.pop}>"
when :text
results << p_e[0][0..new_len]
new_len -= p_e[0].length
else
results << "<!-- #{p_e.inspect} -->"
end
end
if at_end
results << "..."
end
tags.reverse.each do |tag|
results << "</#{tag}>"
end
results
end
private
def attrs_to_s(attrs)
if attrs.empty?
''
else
' ' + attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
end
end
end
Here's a solution that works for me with Textile.
Convert it to HTML
Truncate it.
Remove any HTML tags that got cut in half with
html_string.gsub(/<[^>]*$/, "")
Then, uses Hpricot to clean it up and close unclosed tags
html_string = Hpricot( html_string ).to_s
I do this in a helper, and with caching there's no performance issue.
You could use a regular expression to find a line consisting of nothing but "^" characters:
markdown_string = <<-eos
This article is an example of something or other.
This segment will be used as the snippet on the index page.
^^^^^^^^^^^^^^^
This text will be visible once clicking the "Read more.." link
eos
preview = markdown_string[0...(markdown_string =~ /^\^+$/)]
puts preview
Rather than trying to truncate the text, why not have 2 input boxes, one for the "opening blurb" and one for the main "guts". That way your authors will know exactly what is being show when without having to rely on some sort of funkly EOF marker.
I will have to agree with the "two inputs" approach, and the content writer would need not to worry, since you can modify the background logic to mix the two inputs in one when showing the full content.
full_content = input1 + input2 // perhaps with some complementary html, for a better formatting
Not sure if it applies to this case, but adding the solution below for the sake of completeness. You can use strip_tags method if you are truncating Markdown-rendered contents:
truncate(strip_tags(markdown(article.contents)), length: 50)
Sourced from:
http://devblog.boonecommunitynetwork.com/rails-and-markdown/
A simpler option that just works:
truncate(markdown(item.description), length: 100, escape: false)

Resources