Simple regex problem in Rails - ruby-on-rails

Ok. It's late and I'm tired.
I want to match a character in a string. Specifically, the appearance of 'a'. As in "one and a half".
If I have a string which is all lowercase.
"one and a half is always good" # what a dumb example. No idea how I thought of that.
and I call titleize on it
"one and a half is always good".titleize #=> "One And A Half Is Always Good"
This is wrong because the 'And' and the 'A' should be lowercase. Obviously.
So, I can do
"One and a Half Is always Good".titleize.tr('And', 'and') #=> "One and a Half Is always Good"
My question: how do I make the "A" an "a" and without making the "Always" into "always"?

This does it:
require 'active_support/all'
str = "one and a half is always good" #=> "one and a half is always good"
str.titleize.gsub(%r{\b(A|And|Is)\b}i){ |w| w.downcase } #=> "One and a Half is Always Good"
or
str.titleize.gsub(%r{\b(A(nd)?|Is)\b}i){ |w| w.downcase } #=> "One and a Half is Always Good"
Take your pick of either of the last two lines. The regex pattern could be created elsewhere and passed in as a variable, for maintenance or code cleanliness.

I like Greg's two-liner (first titleize, then use a regex to downcase selected words.) FWIW, here's a function I use in my projects. Well tested, although much more verbose. You'll note that I'm overriding titleize in ActiveSupport:
class String
#
# A better titleize that creates a usable
# title according to English grammar rules.
#
def titleize
count = 0
result = []
for w in self.downcase.split
count += 1
if count == 1
# Always capitalize the first word.
result << w.capitalize
else
unless ['a','an','and','by','for','in','is','of','not','on','or','over','the','to','under'].include? w
result << w.capitalize
else
result << w
end
end
end
return result.join(' ')
end
end

Related

How to set an IF statement comparing to values inside an array? (Ruby)

Is it possible to set a conditional statement (IF statement) comparing a variable against a variable that iterates through the values inside an array? I was looking for something like:
array_of_small_words = ["and","or","be","the","of","to","in"]
if word == array_of_small_words.each
# do thing
else
# do another thing
end
Basically, I want to capitalize each word but don't want to do it for "small words". I know I could do the the opposite and iterate through the array first and then compare each iteration with the word but I was hoping there would be a more efficient way.
sentence = ["this","is","a","sample","of","a","title"]
array_of_small_words = ["and","or","be","the","of","to","in"]
sentence.each do |word|
array_of_small_words.each do |small_words|
if word == small_words
# don't capitalize word
else
# capitalize word
end
end
end
I'm not really sure if this is possible or if there is a better way of doing this?
Thank you!
sentence = ["this","is","a","sample","of","a","title"]
array_of_small_words = ["and","or","be","the","of","to","in"]
sentence.map do |word|
array_of_small_words.include?(word) ? word : word.upcase
end
#⇒ ["THIS", "IS", "A", "SAMPLE", "of", "A", "TITLE"]
What you're looking for is if array_of_small_words.include?(word).
This should be faster than #mudasobwa's repeated use of include? if packaged in a method and used frequency. It would not be faster, however, if mudsie used a set lookup (a minor change, of which he is well-aware), as I mentioned in a comment. If efficiency is important, I'd prefer mudsie's way with the set mod over my answer. In a way I was just playing around below.
I've assumed he small words are and, or, be, the, of, to, in and notwithstanding.
SMALL_WORDS = %w| and or be the of to in notwithstanding |
#=> ["and", "or", "be", "the", "of", "to", "in", "notwithstanding"]
(SMALL_WORDS_HASH = SMALL_WORDS.map { |w| [w.upcase, w] }.to_h).
default_proc = proc { |h,k| h[k]=k }
Test:
SMALL_WORDS_HASH
#=> {"AND"=>"and", "OR"=>"or", "BE"=>"be", "THE"=>"the", "OF"=>"of",
# "TO"=>"to", "IN"=>"in", "NOTWITHSTANDING"=>"notwithstanding"}
SMALL_WORDS_HASH["TO"]
#=> "of"
SMALL_WORDS_HASH["HIPPO"]
#=> "HIPPO"
def convert(arr)
arr.join(' ').upcase.gsub(/\w+/, SMALL_WORDS_HASH)
end
convert ["this","is","a","sample","of","a","title"]
#=> "THIS IS A SAMPLE of A TITLE"

Ruby regex puncuation

I am having trouble writing this so that it will take a sentence as an argument and perform the translation on each word without affecting the punctuation.
I'd also like to continue using the partition method.
It would be nice if I could have it keep a quote together as well, such as:
"I said this", I said.
would be:
"I aidsay histay", I said.
def convert_sentence_pig_latin(sentence)
p split_sentence = sentence.split(/\W/)
pig_latin_sentence = []
split_sentence.each do |word|
if word.match(/^[^aeiou]+/x)
pig_latin_sentence << word.partition(/^[^aeiou]+/x)[2] + word.partition(/^[^aeiou]+/x)[1] + "ay"
else
pig_latin_sentence << word
end
end
rejoined_pig_sentence = pig_latin_sentence.join(" ").downcase + "."
p rejoined_pig_sentence.capitalize
end
convert_sentence_pig_latin("Mary had a little lamb.")
Your main problem is that [^aeiou] matches every character outside that range, including spaces, commas, quotation marks, etc.
If I were you, I'd use a positive match for consonants, ie. [b-df-hj-np-tv-z] I would also put that regex in a variable, so you're not having to repeat it three times.
Also, in case you're interested, there's a way to make your convert_sentence_pig_latin method a single gsub and it will do the whole sentence in one pass.
Update
...because you asked...
sentence.gsub( /\b([b-df-hj-np-tv-z])(\w+)/i ) { "#{$2}#{$1}ay" }
# iterate over and replace regexp matches using gsub
def convert_sentence_pig_latin2(sentence)
r = /^[^aeiou]+/i
sentence.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ r ? "#{y.partition(r)[2]}#{y.partition(r)[1]}ay" : y}}
end
puts convert_sentence_pig_latin2('"I said this", I said.')
# define instance method: String#to_pl
class String
R = Regexp.new '^[^aeiou]+', true # => /^[^aeiou]+/i
def to_pl
self.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ R ? "#{y.partition(R)[2]}#{y.partition(R)[1]}ay" : y}}
end
end
puts '"I said this", I said.'.to_pl
sources:
http://www.ruby-doc.org/core-2.1.0/Regexp.html
http://ruby-doc.org/core-2.0/String.html#method-i-gsub

Ruby: Splitting string on last number character

I have the following method in my Ruby model:
Old:
def to_s
numbers = self.title.scan(/\d+/) if self.title.scan(/\d+/)
return numbers.join.insert(0, "#{self.title.chop} ") if numbers
"#{self.title.titlecase}"
end
New:
def to_s
numbers = self.title.scan(/\d+/)
return numbers.join.insert(0, "#{self.title.sub(/\d+/, '')} ") if numbers.any?
self.title.titlecase
end
A title can be like so: Level1 or TrackStar
So TrackStar should become Track Star and Level1 should be come Level 1, which is why I am doing the scan for numbers to begin with
I am trying to display it like Level 1. The above works, I was just curious to know if there was a more eloquent solution
Try this:
def to_s
self.title.split(/(?=[0-9])/, 2).join(" ")
end
The second argument to split is to make sure a title like "Level10" doesn't get transformed into "Level 1 0".
Edit - to add spaces between words as well, I'd use gsub:
def to_s
self.title.gsub(/([a-z])([A-Z])/, '\1 \2').split(/(?=\d)/, 2).join(" ")
end
Be sure to use single-quotes in the second argument to gsub.
How about this:
'Level1'.split(/(\d+)/).join(' ')
#=> "Level 1"

Ruby regular expression function

I have problem that i cant find a solution for with regular expressions, but i know it has to need it.
say i have a string inputed, say 'asdasd asdaeew asioij'
which a user makes. how would i accomplish this
for every word
execute me
end
Regular expressions seem overkill for this one - just use
s = "aaaaaa bbbb cccc"
s.split.each do |w|
puts w
end
Another possibility, splitting by word boundaries instead of spaces:
kiss = "Keep it simple, stupid!"
kiss.scan(/\b\w+\b/) do |w|
puts w #=> "Keep" ... "it" ... "simple" ... "stupid"
end
# Instead of:
kiss.split.each do |w|
puts w #=> "Keep" ... "it" ... "simple," ... "stupid!"
end

What's the fastest way to check if a word from one string is in another string?

I have a string of words; let's call them bad:
bad = "foo bar baz"
I can keep this string as a whitespace separated string, or as a list:
bad = bad.split(" ");
If I have another string, like so:
str = "This is my first foo string"
What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?
#Find if a word is there
bad.split(" ").each do |word|
found = str.include?(word)
end
#Remove the word
bad.split(" ").each do |word|
str.gsub!(/#{word}/, "")
end
If the list of bad words gets huge, a hash is a lot faster:
require 'benchmark'
bad = ('aaa'..'zzz').to_a # 17576 words
str= "What's the fasted way to check if any word from the bad string is within my "
str += "comparison string, and what's the fastest way to remove said word if it's "
str += "found"
str *= 10
badex = /\b(#{bad.join('|')})\b/i
bad_hash = {}
bad.each{|w| bad_hash[w] = true}
n = 10
Benchmark.bm(10) do |x|
x.report('regex:') {n.times do
str.gsub(badex,'').squeeze(' ')
end}
x.report('hash:') {n.times do
str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
end}
end
user system total real
regex: 10.485000 0.000000 10.485000 ( 13.312500)
hash: 0.000000 0.000000 0.000000 ( 0.000000)
bad = "foo bar baz"
=> "foo bar baz"
str = "This is my first foo string"
=> "This is my first foo string"
(str.split(' ') - bad.split(' ')).join(' ')
=> "This is my first string"
All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:
badex = /\b(#{bad.split.join('|')})\b/i
In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.
#!/usr/bin/ruby
require 'benchmark'
bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10
n = 10_000
Benchmark.bm(20) do |x|
x.report('regex:') do
n.times { str.gsub(badex,'').gsub(' ',' ') }
end
x.report('regex with squeeze:') do
n.times{ str.gsub(badex,'').squeeze(' ') }
end
x.report('array subtraction') do
n.times { (str.split(' ') - bad.split(' ')).join(' ') }
end
end
I made the str variable a lot longer, to make the routines work a bit harder.
user system total real
regex: 0.740000 0.010000 0.750000 ( 0.752846)
regex with squeeze: 0.570000 0.000000 0.570000 ( 0.581304)
array subtraction 1.430000 0.010000 1.440000 ( 1.449578)
Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!
Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.
I haven't seen anything similar in Ruby to Perl's Regexp::Assemble, but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.
And we won't even go into words that are benign or offensive depending on the context.
I added a bit more comprehensive test for the array-subtraction benchmark, to fit how it would need to work in a real piece of code. The if clause is specified in the answer, this now reflects it:
#!/usr/bin/env ruby
require 'benchmark'
bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10
str_split = str.split
bad_split = bad.split
n = 10_000
Benchmark.bm(20) do |x|
x.report('regex') do
n.times { str.gsub(badex,'').gsub(' ',' ') }
end
x.report('regex with squeeze') do
n.times{ str.gsub(badex,'').squeeze(' ') }
end
x.report('bad.any?') do
n.times {
if (bad_split.any? { |bw| str.include?(bw) })
(str_split - bad_split).join(' ')
end
}
end
x.report('array subtraction') do
n.times { (str_split - bad_split).join(' ') }
end
end
with two test runs:
ruby test.rb
user system total real
regex 1.000000 0.010000 1.010000 ( 1.001093)
regex with squeeze 0.870000 0.000000 0.870000 ( 0.873224)
bad.any? 1.760000 0.000000 1.760000 ( 1.762195)
array subtraction 1.350000 0.000000 1.350000 ( 1.346043)
ruby test.rb
user system total real
regex 1.000000 0.010000 1.010000 ( 1.004365)
regex with squeeze 0.870000 0.000000 0.870000 ( 0.868525)
bad.any? 1.770000 0.000000 1.770000 ( 1.775567)
array subtraction 1.360000 0.000000 1.360000 ( 1.359100)
I usually make a point of not optimizing without measurements, but here's a wag:
To make it fast, you should iterate through each string once. You want to avoid a loop with bad count * str count inner compares. So, you could build a big regexp and gsub with it.
(adding foo variants to test word boundary works)
str = "This is my first foo fooo ofoo string"
=> "This is my first foo fooo ofoo string"
badex = /\b(#{bad.split.join('|')})\b/
=> /\b(foo|bar|baz)\b/
str.gsub(badex,'').gsub(' ',' ')
=> "This is my first fooo ofoo string"
Of course the huge resulting regexp might be as slow as the implied nested iteration in my other answer. Only way to know is to measure.
bad = %w(foo bar baz)
str = "This is my first foo string"
# find the first word in the list
found = bad.find {|word| str.include?(word)}
# remove it
str[found] = '' ;# str => "This is my first string"
I'd benchmark this:
bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')
# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }
# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')
any? will quick checking as soon as it sees a match. If you can order your bad words by their probability, you can save some cycles.
Here's one that will check for words and phrases.
def checkContent(str)
bad = ["foo", "bar", "this place sucks", "or whatever"]
# may be best to map and singularize everything as well.
# maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
# maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"
bad_hash = {}
bad_phrase_hash = {}
bad.map(&:downcase).each do |word|
words = word.split().map(&:downcase)
if words.length > 1
words.each do |inner|
if bad_hash.key?(inner)
if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
bad_hash[inner][words.length] = true
elsif bad_hash[inner] === 1
bad_hash[inner] = {1=>true,words.length => true}
end
else
bad_hash[inner] = {words.length => true}
end
end
bad_phrase_hash[word] = true
else
bad_hash[word] = 1
end
end
string = str.split().map(&:downcase)
string.each_with_index do |word,index|
if bad_hash.key?(word)
if bad_hash[word].is_a?(Hash)
if bad_hash[word].key?(1)
return false
else
bad_hash[word].keys.sort.each do |length|
value = string[index...(index + length)].join(" ")
if bad_phrase_hash.key?(value)
return false
end
end
end
else
return false
end
end
end
return true
end
The include? method is what you need. The ruby String specificacion says:
str.include?( string ) -> true or false
Returns true if str contains the given string or character.
"hello".include? "lo" -> true
"hello".include? "ol" -> false
"hello".include? ?h -> true
Note that it has O(n) and what you purposed is O(n^2)

Resources