How can I search for a word using Ruby? - ruby-on-rails

I have a name of a show like oferson of interest.
In my code I am trying to split it into single words then capitilize the first letter of each word, then join them back together with a space between each word which then becomes: Oferson Of Interest. I then want to search for the word Of and replace it with a lower case.
The problem I can't seem to figure out is, at the end of the program I get oferson of Interest which isn't what I want. I just wanted the word "of" to be lower case not the first letter of the word "Oferson", simply put I wanted an output of Oferson of Interest not oferson of Interest.
How can I search for the single word 'of' not for every instance of the letters 'o' and 'f' in the sentence?
mine = 'oferson of interest'.split(' ').map {|w| w.capitalize }.join(' ')
if mine.include? "Of"
mine.gsub!(/Of/, 'of')
else
puts 'noting;'
end
puts mine

The simplest answer is to use word boundaries in your regular expression:
str = "oferson of interest".split.collect(&:capitalize).join(" ")
str.gsub!(/\bOf\b/i, 'of')
# => Oferson of Interest

You're dealing with "stop words": Words you don't want to process for some reason. Build a list of stopwords you want to ignore, and compare each word to them to see whether you want to do further processing to it:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
puts text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
end
# >> a Stitch In Time Saves Nine
# >> the Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
That's a simple example, but shows the basics. In real life you'll want to handle punctuation, like hyphenated words.
I used a Set, because it's extremely fast as the list of stop words grows; It's akin to a Hash so the check is faster than using include? on an array:
require 'set'
require 'fruity'
LETTER_ARRAY = ('a' .. 'z').to_a
LETTER_SET = LETTER_ARRAY.to_set
compare do
array {LETTER_ARRAY.include?('0') }
set { LETTER_SET.include?('0') }
end
# >> Running each test 16384 times. Test will take about 2 seconds.
# >> set is faster than array by 10x ± 0.1
It gets more interesting when you want to protect the first letter of the resulting string, but the simple trick is to force just that letter back to uppercase if it matters:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
str = text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
str[0] = str[0].upcase
puts str
end
# >> A Stitch In Time Saves Nine
# >> The Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
This isn't a good task for a regular expression, unless you're dealing with very consistent text patterns. Since you're working on the names of TV shows, odds are good you're not going to find much consistency and your pattern would grow in complexity quickly.

Related

Looping through array targeting upcase letters only

Am trying to loop through a string which i have converted to an array and target only the upcase letters which i will then insert an empty space before the capitalized letter. My code checks for the first cap letter and adds the space but am struggling to do it for the next cap letter which in this case is "T". Any advise would be appreciated. Thanks
def break_camel(str)
# ([A-Z])/.match(str)
saved_string = str.freeze
cap_index =str.index(/[A-Z]/)
puts(cap_index)
x =str.split('').insert(cap_index, " ")
x.join
end
break_camel("camelCasingTest")
It's much easier to operate on your string directly, using String#gsub, than breaking it into pieces, operating on each piece then gluing everything back together again.
def break_camel(str)
str.gsub(/(?=[A-Z])/, ' ')
end
break_camel("camelCasingTest")
#=> "camel Casing Test"
break_camel("CamelCasingTest")
#=> " Camel Casing Test"
This converts a "zero-width position", immediately before each capital letter (and after the preceding character, if there is one), to a space. The expression (?=[A-Z]) is called a positive lookahead.
If you don't want to insert a space if the capital letter is at the beginning of a line, change the method as follows.
def break_camel(str)
str.gsub(/(?<=.)(?=[A-Z])/, ' ')
end
break_camel("CamelCasingTest")
#=> "Camel Casing Test"
(?<=.) is a positive lookbehind that requires the capital letter to be preceded by any character for the match to be made.
Another way of writing this is as follows.
def break_camel(str)
str.gsub(/(?<=.)([A-Z]))/, ' \1')
end
break_camel("CamelCasingTest")
#=> "Camel Casing Test"
Here the regular expression matches a capital letter that is not at the beginning of the line and saves it to capture group 1. It is then replaced by a space followed by the contents of capture group 1.
I think your approach is looking to keep reapplying your method until needed. One extension of your code is to use recursion:
def break_camel(str)
regex = /[a-z][A-Z]/
if str.match(regex)
cap_index = str.index(regex)
str.insert(cap_index + 1, " ")
break_camel(str)
else
str
end
end
break_camel("camelCasingTest") #=> "camel Casing Test"
Notice the break_camel method inside the method. Another way is by using the scan method passing the appropriate regex before rejoining them.
In code:
'camelCasingTest'.scan(/[A-Z]?[a-z]+/).join(' ') #=> "camel Casing Test"
Do you have to implement your own?
Looks like titleize https://apidock.com/rails/ActiveSupport/Inflector/titleize has this covered.

Rails array INCLUDE with only distinct words

I'm building a profanity search function which needs to find instances of an array of profane words in a long string of text.
One could do a simple include like:
if profane_words.any? {|word| self.name.downcase.include? word}
...
end
This results in a positive match if ANY of the array of profane words are present anywhere in the text.
However, if a word like 'hell' is considered profane, this would produce a positive match against "Hell's Angels" or "Hell's Kitchen", which is undesirable.
How can the above search be modified to only produce positive results against distinct words or phrases? For example, "Hell Angels" returns positive but "Hell's Angels" returns negative.
To be clear, this means we're searching for any instance of a profane word that is immediately preceded or followed by another character or apostrophe.
What about using a regex ?
profane_words.any? { |word| self.name.downcase.match? /#{word}(?!')/ }
Examples:
"hell's angels".match?(/hell(?!')/) # => false
"hell angel".match?(/hell(?!')/) # => true
(?!') is a negative lookup meaning it won't match if the word has a ' right after it. If you'd like to exclude other characters you can add it to the list with pipes e.g. (?!'|") won't match ' and ".
See https://www.regular-expressions.info/lookaround.html for reference.
And you could make it more performant like this:
self.name.downcase.match? /#{profane_words.join('|')}(?!')/
if profane_words.any? {|word| self.name.downcase.split(' ').include? word} ... end
You should definitely use a Regex containing all your profane words followed by a space or period. Bellow yo
> "Hell's angels".match(/(hell|shit)[ .]/i)
=> nil
> "Hell angels".match(/(hell|shit)[ .]/i)
=> #<MatchData "Hell " 1:"Hell">
> "Hell's angels shit".match(/(hell|shit)[ .]/i)
=> nil

Regular expression in Ruby - extracting from Gutenberg

I am fairly new to Ruby and I am struggling with a regular expression to seed a database from this text file: http://www.gutenberg.org/cache/epub/673/pg673.txt.
I want the <h1> tags as the words for the dictionary database, and the <def> tags as the definitions.
I could be quite off base here (I've only ever seeded a db with copy and past ;):
require 'open-uri'
Dictionary.delete_all
g_text = open('http://www.gutenberg.org/cache/epub/673/pg673.txt')
y = g_text.read(/<h1>(.*?)<\/h1>/)
a = g_text.read(/<def>(.*?)<\/def>/)
Dictionary.create!(:word => y, :definition => a)
As you can see, there are often more than one <def> for each <h1>, which is fine, as I can just add columns to my table for definition1, definition2, etc.
But what would this regular expression look like to be sure that each definition is in the same row as the immediately preceding <h1> tag?
Thanks for an help!
Edit:
Okay, so this is what i am trying now:
doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
p [m,n]
end
How do I get rid of all of the nil entries?
It seems like regular expression is the only way of making it through the whole document without stopping part way through when an error is encountered...at least after a couple attempts at other parsers.
what I came to (with a local extract for sandbox use):
require 'pp' # For SO to pretty print the hash at end
h1regex="h1>(.+)<\/h1" # Define the hl regex (avoid empty tags)
defregex="def>(.+)<\/def" # define the def regex (avoid empty tags)
# Initialize vars
defhash={}
key=nil
last=nil
open("./gut.txt") do |f|
f.each_line do |l|
newkey=l[/#{h1regex}/i,1] # get the next key (or nothing)
if (newkey != last && newkey != nil) then # if we changed key, update the hash (some redundant hl entries with other defs)
key = last = newkey # update current key
defhash[key] = [] # init the new entry to empty array
end
if l[/#{defregex}/i] then
defhash[key] << l[/#{defregex}/i,1] # we did match a def, add it to the current key array
end
end
end
pp defhash # print the result
Which give this output:
{"A"=>
[" The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <spn>Alpha</spn>, of the same form; and this was made from the first letter (<i>Aleph</i>, and itself from the Egyptian origin. The <i>Aleph</i> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <i>Alpha</i> with the \\'84 sound, the Ph\\'d2nician alphabet having no vowel symbols.",
"The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A&flat;) is the name of a tone intermediate between A and G.",
"In each; to or for each; <as>as, \"twenty leagues <ex>a</ex> day\", \"a hundred pounds <ex>a</ex> year\", \"a dollar <ex>a</ex> yard\", etc.</as>",
"In; on; at; by.",
"In process of; in the act of; into; to; -- used with verbal substantives in <i>-ing</i> which begin with a consonant. This is a shortened form of the preposition <i>an</i> (which was used before the vowel sound); as in <i>a</i> hunting, <i>a</i> building, <i>a</i> begging. \"Jacob, when he was <i>a</i> dying\" <i>Heb. xi. 21</i>. \"We'll <i>a</i> birding together.\" \" It was <i>a</i> doing.\" <i>Shak.</i> \"He burst out <i>a</i> laughing.\" <i>Macaulay</i>. The hyphen may be used to connect <i>a</i> with the verbal substantive (as, <i>a</i>-hunting, <i>a</i>-building) or the words may be written separately. This form of expression is now for the most part obsolete, the <i>a</i> being omitted and the verbal substantive treated as a participle.",
"Of.",
" A barbarous corruption of <i>have</i>, of <i>he</i>, and sometimes of <i>it</i> and of <i>they</i>."],
"Abalone"=>
["A univalve mollusk of the genus <spn>Haliotis</spn>. The shell is lined with mother-of-pearl, and used for ornamental purposes; the sea-ear. Several large species are found on the coast of California, clinging closely to the rocks."],
"Aband"=>["To abandon.", "To banish; to expel."],
"Abandon"=>
["To cast or drive out; to banish; to expel; to reject.",
"To give up absolutely; to forsake entirely ; to renounce utterly; to relinquish all connection with or concern on; to desert, as a person to whom one owes allegiance or fidelity; to quit; to surrender.",
"Reflexively : To give (one's self) up without attempt at self-control ; to yield (one's self) unrestrainedly ; -- often in a bad sense.",
"To relinquish all claim to; -- used when an insured person gives up to underwriters all claim to the property covered by a policy, which may remain after loss or damage by a peril insured against."]}
Hope it can help.
Late edit: there's probably a better way, I'm not a ruby expert. I was just giving a usual advice while reviewing, but as it seems no one has answered this is how I would do it.

Fastest way to search and replace in a string in Ruby?

I'm building a library that cleans up user generated content and have thousands of string replacements to make (performance is key).
What's the fastest way to do search and replacements in strings?
Here's an example of the replacements the library will make:
u2 => you too
2day => today
2moro => tomorrow
2morrow => tomorrow
2tomorow => tomorrow
There are four cases on how the string can appear:
Starting word in the string (has a space at the end, but not in front of it) 2day sample
Middle of the string (has a space in front and at the end of it) sample 2day sample
End of the string (only has a space in front, but is the last word) sample 2day
The entire string is a match 2day
i.e. The regex shouldn't replace it if it's in the middle of a word like sample2daysample
A possible solution:
replaces = {'u2' => 'you too', '2day' => 'today', '2moro' => 'tomorrow'}
str = '2day and 2moro are u2 sample2daysample'
#exp = Regexp.union(replaces.keys) #it is the best but to use \b this should be a quiet different
exp = Regexp.new(replaces.keys.map { |x| "\\b" + Regexp.escape(x) + "\\b" }.join('|'))
str = str.gsub(exp, replaces)
# => "today and tomorrow are you too sample2daysample"
Full Disclosure: I am the author of this gem
If you don't need regex you can try https://github.com/jedld/multi_string_replace this uses the aho-corasick algorithm to achieve this.
user system total real
multi gsub 1.322510 0.000000 1.322510 ( 1.344405)
MultiStringReplace 0.196823 0.007979 0.204802 ( 0.207219)
mreplace 0.200593 0.004031 0.204624 ( 0.205379)
The only issue I see is that the algorithm does not understand word boundaries so you have to decompose your use case to:
"2day ", " 2day ", " 2day"

Print first 100 words of a text before ellipsing

If I have some text that I want to print out on a page, but only want to print say the first 100 words before eclipsing it... what's the easiest way to do this?
How's this for a start:
def first_words(s, n)
a = s.split(/\s/) # or /[ ]+/ to only split on spaces
a[0...n].join(' ') + (a.size > n ? '...' : '')
end
s = "The quick brown fox jumps over the lazy dog. " * 20
puts "#{s.size}, #{s.split(/\s/).size}"
#-> 900, 180
puts first_words(s, 10)
#-> The quick brown fox jumps over the lazy dog. The...
puts first_words("a b c d", 10)
#-> a b c d
You have a couple of options, one way is that you could say that a word is n characters and then take a substring of that length, append the ellipsis to the end and display it. Or you could run though the string and count the number of spaces, if you assume that there is only one space between each of the words, then the 100th space will be after then 100th word, append the ellipsis and you are done.
Which one has better performance would likely depend upon how the functions are written, most likely the substring operation is going to be faster than counting the spaces. However, the performance difference might be negligible so unless you are doing this a lot, counting spaces would likely be the most accurate way to go.
Also, just as a reference, the average length of a word in the English language is 5.1 characters.
text.slice(0..100)
if text.size > 100 then puts "..."
http://www.ruby-doc.org/core/classes/String.html

Resources