Fastest way to search and replace in a string in Ruby? - ruby-on-rails

I'm building a library that cleans up user generated content and have thousands of string replacements to make (performance is key).
What's the fastest way to do search and replacements in strings?
Here's an example of the replacements the library will make:
u2 => you too
2day => today
2moro => tomorrow
2morrow => tomorrow
2tomorow => tomorrow
There are four cases on how the string can appear:
Starting word in the string (has a space at the end, but not in front of it) 2day sample
Middle of the string (has a space in front and at the end of it) sample 2day sample
End of the string (only has a space in front, but is the last word) sample 2day
The entire string is a match 2day
i.e. The regex shouldn't replace it if it's in the middle of a word like sample2daysample

A possible solution:
replaces = {'u2' => 'you too', '2day' => 'today', '2moro' => 'tomorrow'}
str = '2day and 2moro are u2 sample2daysample'
#exp = Regexp.union(replaces.keys) #it is the best but to use \b this should be a quiet different
exp = Regexp.new(replaces.keys.map { |x| "\\b" + Regexp.escape(x) + "\\b" }.join('|'))
str = str.gsub(exp, replaces)
# => "today and tomorrow are you too sample2daysample"

Full Disclosure: I am the author of this gem
If you don't need regex you can try https://github.com/jedld/multi_string_replace this uses the aho-corasick algorithm to achieve this.
user system total real
multi gsub 1.322510 0.000000 1.322510 ( 1.344405)
MultiStringReplace 0.196823 0.007979 0.204802 ( 0.207219)
mreplace 0.200593 0.004031 0.204624 ( 0.205379)
The only issue I see is that the algorithm does not understand word boundaries so you have to decompose your use case to:
"2day ", " 2day ", " 2day"

Related

Rails array INCLUDE with only distinct words

I'm building a profanity search function which needs to find instances of an array of profane words in a long string of text.
One could do a simple include like:
if profane_words.any? {|word| self.name.downcase.include? word}
...
end
This results in a positive match if ANY of the array of profane words are present anywhere in the text.
However, if a word like 'hell' is considered profane, this would produce a positive match against "Hell's Angels" or "Hell's Kitchen", which is undesirable.
How can the above search be modified to only produce positive results against distinct words or phrases? For example, "Hell Angels" returns positive but "Hell's Angels" returns negative.
To be clear, this means we're searching for any instance of a profane word that is immediately preceded or followed by another character or apostrophe.
What about using a regex ?
profane_words.any? { |word| self.name.downcase.match? /#{word}(?!')/ }
Examples:
"hell's angels".match?(/hell(?!')/) # => false
"hell angel".match?(/hell(?!')/) # => true
(?!') is a negative lookup meaning it won't match if the word has a ' right after it. If you'd like to exclude other characters you can add it to the list with pipes e.g. (?!'|") won't match ' and ".
See https://www.regular-expressions.info/lookaround.html for reference.
And you could make it more performant like this:
self.name.downcase.match? /#{profane_words.join('|')}(?!')/
if profane_words.any? {|word| self.name.downcase.split(' ').include? word} ... end
You should definitely use a Regex containing all your profane words followed by a space or period. Bellow yo
> "Hell's angels".match(/(hell|shit)[ .]/i)
=> nil
> "Hell angels".match(/(hell|shit)[ .]/i)
=> #<MatchData "Hell " 1:"Hell">
> "Hell's angels shit".match(/(hell|shit)[ .]/i)
=> nil

How can I search for a word using Ruby?

I have a name of a show like oferson of interest.
In my code I am trying to split it into single words then capitilize the first letter of each word, then join them back together with a space between each word which then becomes: Oferson Of Interest. I then want to search for the word Of and replace it with a lower case.
The problem I can't seem to figure out is, at the end of the program I get oferson of Interest which isn't what I want. I just wanted the word "of" to be lower case not the first letter of the word "Oferson", simply put I wanted an output of Oferson of Interest not oferson of Interest.
How can I search for the single word 'of' not for every instance of the letters 'o' and 'f' in the sentence?
mine = 'oferson of interest'.split(' ').map {|w| w.capitalize }.join(' ')
if mine.include? "Of"
mine.gsub!(/Of/, 'of')
else
puts 'noting;'
end
puts mine
The simplest answer is to use word boundaries in your regular expression:
str = "oferson of interest".split.collect(&:capitalize).join(" ")
str.gsub!(/\bOf\b/i, 'of')
# => Oferson of Interest
You're dealing with "stop words": Words you don't want to process for some reason. Build a list of stopwords you want to ignore, and compare each word to them to see whether you want to do further processing to it:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
puts text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
end
# >> a Stitch In Time Saves Nine
# >> the Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
That's a simple example, but shows the basics. In real life you'll want to handle punctuation, like hyphenated words.
I used a Set, because it's extremely fast as the list of stop words grows; It's akin to a Hash so the check is faster than using include? on an array:
require 'set'
require 'fruity'
LETTER_ARRAY = ('a' .. 'z').to_a
LETTER_SET = LETTER_ARRAY.to_set
compare do
array {LETTER_ARRAY.include?('0') }
set { LETTER_SET.include?('0') }
end
# >> Running each test 16384 times. Test will take about 2 seconds.
# >> set is faster than array by 10x ± 0.1
It gets more interesting when you want to protect the first letter of the resulting string, but the simple trick is to force just that letter back to uppercase if it matters:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
str = text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
str[0] = str[0].upcase
puts str
end
# >> A Stitch In Time Saves Nine
# >> The Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
This isn't a good task for a regular expression, unless you're dealing with very consistent text patterns. Since you're working on the names of TV shows, odds are good you're not going to find much consistency and your pattern would grow in complexity quickly.

How to split string into 2 parts after certain position

For example I have some random string:
str = "26723462345"
And I want to split it in 2 parts after 6-th char. How to do this correctly?
Thank you!
This should do it
[str[0..5], str[6..-1]]
or
[str.slice(0..5), str.slice(6..-1)]
Really should check out http://corelib.rubyonrails.org/classes/String.html
Here’s on option. Be aware, however, that it will mutate your original string:
part1, part2 = str.slice!(0...6), str
p part1 # => "267234"
p part2 # => "62345"
p str # => "62345"
Update
In the years since I wrote this answer I’ve come to agree with the commenters complaining that it might be excessively clever. Below are a few other options that don’t mutate the original string.
Caveat: This one will only work with ASCII characters.
str.unpack("a6a*")
# => ["267234", "62345"]
The next one uses the magic variable $', which returns the part of the string after the most recent Regexp match:
part1, part2 = str[/.{6}/], $'
p [part1, part2]
# => ["267234", "62345"]
And this last one uses a lookbehind to split the string in the right place without returning any extra parts:
p str.split(/(?<=^.{6})/)
# => ["267234", "62345"]
The best way IMO is string.scan(/.{6}/)
irb(main)> str
=> "abcdefghijklmnopqrstuvwxyz"
irb(main)> str.scan(/.{13}/)
=> ["abcdefghijklm", "nopqrstuvwxyz"]
_, part1, part2 = str.partition /.{6}/
https://ruby-doc.org/core-1.9.3/String.html#method-i-partition
As a fun answer, how about:
str.split(/(^.{1,6})/)[1..-1]
This works because split returns the capture group matches, in addition to the parts of the string before and after the regular expression.
Here's a reusable version for you:
str = "26723462345"
n = str.length
boundary = 6
head = str.slice(0, boundary) # => "267234"
tail = str.slice(boundary, n) # => "62345"
It also preserves the original string, which may come in handy later in the program.

Ruby: Extracting Words From String

I'm trying to parse words out of a string and put them into an array. I've tried the following thing:
#string1 = "oriented design, decomposition, encapsulation, and testing. Uses "
puts #string1.scan(/\s([^\,\.\s]*)/)
It seems to do the trick, but it's a bit shaky (I should include more special characters for example). Is there a better way to do so in ruby?
Optional: I have a cs course description. I intend to extract all the words out of it and place them in a string array, remove the most common word in the English language from the array produced, and then use the rest of the words as tags that users can use to search for cs courses.
The split command.
words = #string1.split(/\W+/)
will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.
For me the best to spliting sentences is:
line.split(/[^[[:word:]]]+/)
Even with multilingual words and punctuation marks work perfectly:
line = 'English words, Polski Żurek!!! crème fraîche...'
line.split(/[^[[:word:]]]+/)
=> ["English", "words", "Polski", "Żurek", "crème", "fraîche"]
Well, you could split the string on spaces if that's your delimiter of interest
#string1.split(' ')
Or split on word boundaries
\W # Any non-word character
\b # Any word boundary character
Or on non-words
\s # Any whitespace character
Hint: try testing each of these on http://rubular.com
And note that ruby 1.9 has some differences from 1.8
For Rails you can use something like this:
#string1.split(/\s/).delete_if(&:blank?)
I would write something like this:
#string
.split(/,+|\s+/) # any ',' or any whitespace characters(space, tab, newline)
.reject(&:empty?)
.map { |w| w.gsub(/\W+$|^\W+^*/, '') } # \W+$ => any trailing punctuation; ^\W+^* => any leading punctuation
irb(main):047:0> #string1 = "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
=> "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
irb(main):048:0> #string1.split(/,+|\s+/).reject(&:empty?).map { |w| w.gsub(/\W+$|^\W+^*/, '')}
=> ["oriented", "design", "with", "qwe", "and", "testing", "can't", "rubyisgood", "and", "rails", "is", "good"]

Best way to count words in a string in Ruby?

Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4

Resources