I'm trying to display an array of words from a user's post. However the method I'm using treats an apostrophe like whitespace.
<%= var = Post.pluck(:body) %>
<%= var.join.downcase.split(/\W+/) %>
So if the input text was: The baby's foot
it would output the baby s foot,
but it should be the baby's foot.
How do I accomplish that?
Accepted answer is too naïve:
▶ "It’s naïve approach".split(/[^'\w]+/)
#⇒ [
# [0] "It",
# [1] "s",
# [2] "nai",
# [3] "ve",
# [4] "approach"
# ]
this is because nowadays there is almost 2016 and many users might want to use their normal names, like, you know, José Østergaard. Punctuation is not only the apostroph, as you might notice.
▶ "It’s naïve approach".split(/[^'’\p{L}\p{M}]+/)
#⇒ [
# [0] "It’s",
# [1] "naïve",
# [2] "approach"
# ]
Further reading: Character Properties.
Along the lines of mudasobwa's answer, here's what \w and \W bring to the party:
chars = [*' ' .. "\x7e"].join
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
That's the usual visible lower-ASCII characters we'd see in code. See the Regexp documentation for more information.
Grabbing the characters that match \w returns:
chars.scan(/\w+/)
# => ["0123456789",
# "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
# "_",
# "abcdefghijklmnopqrstuvwxyz"]
Conversely, grabbing the characters that don't match \w, or that match \W:
chars.scan(/\W+/)
# => [" !\"\#$%&'()*+,-./", ":;<=>?#", "[\\]^", "`", "{|}~"]
\w is defined as [a-zA-Z0-9_] which is not what you want to normally call "word" characters. Instead they're typically the characters we use to define variable names.
If you're dealing with only lower-ASCII characters, use the character-class
[a-zA-Z]
For instance:
chars = [*' ' .. "\x7e"].join
lower_ascii_chars = '[a-zA-Z]'
not_lower_ascii_chars = '[^a-zA-Z]'
chars.scan(/#{lower_ascii_chars}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/#{not_lower_ascii_chars}+/)
# => [" !\"\#$%&'()*+,-./0123456789:;<=>?#", "[\\]^_`", "{|}~"]
Instead of defining your own, you can take advantage of the POSIX definitions and character properties:
chars.scan(/[[:alpha:]]+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/\p{Alpha}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
Regular expressions always seem like a wonderful new wand to wave when extracting information from a string, but, like the Sorcerer's Apprentice found out, they can create havoc when misused or not understood.
Knowing this should help you write a bit more intelligent patterns. Apply that to what the documentation shows and you should be able to easily figure out a pattern that does what you want.
You can use below RegEx instead of /\W+/
var.join.downcase.split(/[^'\w]+/)
/\W/ refers to all non-word characters, apostrophe is one such non-word character.
To keep the code as close to original intent, we can use /[^'\w]/ - this means that all characters that are not apostrophe and word character.
Running that string through irb with the same split call that you wrote in your comment gets this:
irb(main):008:0> "The baby's foot".split(/\W+/)
=> ["The", "baby", "s", "foot"]
However, if you use split without an explicit delimiter, you get the split you're looking for:
irb(main):009:0> "The baby's foot".split
=> ["The", "baby's", "foot"]
Does that get you what you're looking for?
Related
I'm trying to write a regular expression in Ruby where I want to see if the string contains a certain word (e.g. "string"), followed by a url and link name in parenthesis.
Right now I'm doing:
string.include?("string") && string.scan(/\(([^\)]+)\)/).present?
My input in both conditionals is a string. In the first one, I'm checking if it contains the word "link" and then I will have the link and link_name in parenthesis, like this:
"Please go to link( url link_name)"
After validating that, I extract the HTML link.
Is there a way I can combine them using regular expressions?
The most important improvement you can make is to also test that the word and the parentheseses have the correct relationship. If I understand correctly, "link(url link_name)" should be a match but "(url link_name)link" or "link stuff (url link_name)" should not. So match "link", the parentheses, and their contents, and capture the contents, all at once:
"stuff link(url link_name) more stuff".match(/link\((\S+?) (\S+?)\)/)&.captures
=> ["url", "link_name"]
(&. is Ruby 2.3; use Rails' .try :captures in older versions.)
Side note: string.scan(regex).present? is more concisely written as string =~ regex.
Checking If a Word Is Contained
If you want to find matches that contain a specific word somewhere in the string, you can accomplish this through a lookahead :
# This will match any string that contains your string "{your-string-here}"
(?=.*({your-string-here}).*).*
You could consider building a string version of your expression and passing the word you are looking for using a variable :
wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*/
# stringToTest contains "link"
else
# stringToTest does not contain "link"
end
Checking for a Word AND Parentheses
If you also wanted to ensure that somewhere in your string you had a set of parentheses with some content in them and your previous lookahead for a word, you could use :
# This will match any strings that contain your word and contain a set of parentheses
(?=.*({your-string-here}).*).*\([^\)]+\).*
which might be used as :
wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*\([^\)]+\).*/
# stringToTest contains "link" and some non-empty parentheses
else
# stringToTest does not contain "link" or non-empty parentheses
end
def has_both?(str, word)
str.scan(/\b#{word}\b|(?<=\()[^\(\)]+(?=\))/).size == 2
end
has_both?("Wait for me, Wild Bill.", "Bill")
#=> false
has_both?("Wait (for me), Wild William.", "Bill")
#=> false
has_both?("Wait (for me), Wild Billy.", "Bill")
#=> false
has_both?("Wait (for me), Wild bill.", "Bill")
#=> false
has_both?("Wait (for me, Wild Bill.", "Bill")
#=> false
has_both?("Wait (for me), Wild Bill.", "Bill")
#=> true
has_both?("Wait ((for me), Wild Bill.", "Bill")
#=> true
has_both?("Wait ((for me)), Wild Bill.", "Bill")
#=> true
These are the calculations for
word = "Bill"
str = "Wait (for me), Wild Bill."
r = /
\b#{word}\b # match the value of the variable 'word' with word breaks for and aft
| # or
(?<=\() # match a left paren in a positive lookbehind
[^\(\)]+ # match one or more characters other than parens
(?=\)) # match a right paren in a positive lookahead
/x # free-spacing regex definition mode
#=> /
\bBill\b # match the value of the variable 'word' with word breaks for and aft
| # or
(?<=\() # match a left paren in a positive lookbehind
[^\(\)]+ # match one or more characters other than parens
(?=\)) # match a right paren in a positive lookahead
/x
arr = str.scan(r)
#=> ["for me", "Bill"]
arr.size == 2
#=> true
I would go with something like this regex:
/link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i
This will find any match starting with the word link, followed by any number of spaces, then a url followed by a link name, both in parentheses. In this regex, the link name is optional, but the url is not. The matching is case-insensitive, so it will match link and LINK exactly the same.
You can use the Regexp#match method to compare the regex to a string, and check the result for matches and captures, like so:
m = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match("link (stackoverflow.com StackOverflow)")
if m # the match array is not nil
puts "Matched: #{m[0]}"
puts " -- url: {m[1]}"
puts " -- link-name: #{m[2] || 'none'}"
else # the match array is nil, so no match was found
puts "No match found"
end
If you'd like to use different strings to identify the match, you can use a non-capturing group, where you change link to something like:
(?:link|site|website|url)
In this case, the (?: syntax says not to capture this part of the match. If you want to capture which term matched, simply change that from (?: to (, and adjust the capture indexes by 1 to account for the new capture value.
Here's a short Ruby test program:
data = [
[ true, "link (http://google.com Google)", "http://google.com", "Google" ],
[ true, "LiNk(ftp://website.org)", "ftp://website.org", nil ],
[ true, "link (https://facebook.com/realstanlee/ Stan Lee) linkety link", "https://facebook.com/realstanlee/", "Stan Lee" ],
[ true, "x link (https://mail.yahoo.com Yahoo! Mail)", "https://mail.yahoo.com", "Yahoo! Mail" ],
[ false, "link lunk (http://www.com)", nil, nil ]
]
data.each do |test_case|
link = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match(test_case[1])
url = link ? link[1] : nil
link_name = link ? link[2] : nil
success = test_case[0] == !link.nil? && test_case[2] == url && test_case[3] == link_name
puts "#{success ? 'Pass' : 'Fail'}: '#{test_case[1]}' #{link ? 'found' : 'not found'}"
if success && link
puts " -- url: '#{url}' link_name: '#{link_name || '(no link name)'}'"
end
end
This produces the following output:
Pass: 'link (http://google.com Google)' found
-- url: 'http://google.com' link_name: 'Google'
Pass: 'LiNk(ftp://website.org)' found
-- url: 'ftp://website.org' link_name: '(no link name)'
Pass: 'link (https://facebook.com/realstanlee/ Stan Lee) linkety link' found
-- url: 'https://facebook.com/realstanlee/' link_name: 'Stan Lee'
Pass: 'x link (https://mail.yahoo.com Yahoo! Mail)' found
-- url: 'https://mail.yahoo.com' link_name: 'Yahoo! Mail'
Pass: 'link lunk (http://www.com)' not found
If you want to allow anything other than spaces between the word 'link' and the first paren, simply change the \s* to [^\(]* and you should be good to go.
Two questions that I believe are connected and I think are regex related but have me stumped after some fruitless googling.
validates :image_url, format: { with: %r{\.(gif|jpg)\Z}i }
My guesses: similar to ruby/regex i = ignore case, single pipe means 'or'. Guessing \Z means end of string. The brackets are just containers unlike ruby/regex where they signify something wildly different.
But what does the %r do? I haven't run across that in ruby/regex.
ok_urls = %w{ fred.gif fred.jpg FRED.Jpg}
%r and %w seem to be doing the same thing so I'm confused why there are two separate commands to do the same thing. Sorry if this isn't very clear.
A Regexp holds a regular expression, used to match a pattern against strings. Regexps are created using the /.../ and %r{...} literals, and by the Regexp::new constructor.
%r and %w seem to be doing the same thing so I'm confused..
%w{ fred.gif fred.jpg FRED.Jpg}
# => ["fred.gif", "fred.jpg", "FRED.Jpg"]
%r{ a b }
# => / a b /
No. They are not same, as you can see above.
One thing I noticed with %r{}, as you don't need to escape slashes.
# /../ literals:
url.match /http:\/\/example\.com\//
# => #<MatchData "http://example.com/">
# %r{} literals:
url.match %r{http://example\.com/}
# => #<MatchData "http://example.com/">
Use %r only for regular expressions matching more than one '/' character.
# bad
%r(\s+)
# still bad
%r(^/(.*)$)
# should be /^\/(.*)$/
# good
%r(^/blog/2011/(.*)$)
I have:
str ="this is the string "
and I have an array of strings:
array =["this is" ,"second element", "third element"]
I want to process the string such that substring matching any of the element of the array should be removed and rest of the string should be returned. I want the following output:
output: "the string "
How can i do this?.
You don't say whether you want true substring matching, or substring matching at word-boundaries. There's a difference. Here's how to do it honoring word boundaries:
str = "this is the string "
array = ["this is" ,"second element", "third element"]
pattern = /\b(?:#{ Regexp.union(array).source })\b/ # => /\b(?:this\ is|second\ element|third\ element)\b/
str[pattern] # => "this is"
str.gsub(pattern, '').squeeze(' ').strip # => "the string"
Here's what's happening with union and union.source:
Regexp.union(array) # => /this\ is|second\ element|third\ element/
Regexp.union(array).source # => "this\\ is|second\\ element|third\\ element"
source returns the joined array in a form that can be more easily consumed by Regex when creating a pattern, without it injecting holes into the pattern. Consider these differences and what they could do in a pattern match:
/#{ Regexp.union(%w[a . b]) }/ # => /(?-mix:a|\.|b)/
/#{ Regexp.union(%w[a . b]).source }/ # => /a|\.|b/
The first creates a separate pattern, with its own flags for case, multiple-line and white-space honoring, that would be embedded inside the outer pattern. That can be a bug that's really hard to track down and fix, so only do it when you intend to have the sub-pattern.
Also, notice what happens if you try to use:
/#{ %w[a . b].join('|') }/ # => /a|.|b/
The resulting pattern has a wildcard . embedded in it, which would blow apart your pattern, causing it to match anything. Don't go there.
If we don't tell the regex engine to honor word boundaries then unexpected/undesirable/terrible things can happen:
str = "this isn't the string "
array = ["this is" ,"second element", "third element"]
pattern = /(?:#{ Regexp.union(array).source })/ # => /(?:this\ is|second\ element|third\ element)/
str[pattern] # => "this is"
str.gsub(pattern, '').squeeze(' ').strip # => "n't the string"
It's important to think in terms of words, when working with substrings containing complete words. The engine doesn't know the difference, so you have to tell it what to do. This is a situation missed all too often by people who haven't had to do text processing.
Here is one way -
array =["this is" ,"second element", "third element"]
str = "this is the string "
str.gsub(Regexp.union(array),'') # => " the string "
To allow case-insensitive - str.gsub(/#{array.join('|')}/i,'')
I saw two kinds of solutions and at first I prefer Brad's. But I think the two approaches are so different that there must be a performance diff so I created below file and run it.
require 'benchmark/ips'
str = 'this is the string '
array =['this is' ,'second element', 'third element']
def by_loop(str, array)
array.inject(str) { |result , substring| result.gsub substring, '' }
end
def by_regex(str, array)
str.gsub(Regexp.union(array),'')
end
def by_loop_large(str, array)
array = array * 100
by_loop(str, array)
end
def by_regex_large(str, array)
array = array * 100
by_regex(str, array)
end
Benchmark.ips do |x|
x.report("loop") { by_loop(str, array) }
x.report("regex") { by_regex(str, array) }
x.report("loop large") { by_loop_large(str, array) }
x.report("regex large") { by_regex_large(str, array) }
end
The result:
-------------------------------------------------
loop 16719.0 (±10.4%) i/s - 83888 in 5.073791s
regex 18701.5 (±4.2%) i/s - 94554 in 5.063600s
loop large 182.6 (±0.5%) i/s - 918 in 5.027865s
regex large 330.9 (±0.6%) i/s - 1680 in 5.076771s
The conclusion:
Arup's approach is much more efficient when the array going large.
As to the Tin Man's concern of single quote in text, I think it's very important but that would be the responsibility of OP but not the current algorithms. And the two approaches produce the same on that string.
I am saving a price string to my database in a decimal-type column.
The price comes in like this "$ 123.99" which is fine because I wrote a bit of code to remove the "$ ".
However, I forgot that the price may include a comma, so "$ 1,234.99" breaks my code. How can I also remove the comma?
This is my code to remove dollar sign and space:
def price=(price_str)
write_attribute(:price, price_str.sub("$ ", ""))
# possible code to remove comma also?
end
You can get there two ways easily.
String's delete method is good for removing all occurrences of the target strings:
'$ 1.23'.delete('$ ,') # => "1.23"
'$ 123,456.00'.delete('$ ,') # => "123456.00"
Or, use String's tr method:
'$ 1.23'.tr('$, ', '') # => "1.23"
'$ 123,456.00'.tr('$ ,', '') # => "123456.00"
tr takes a string of characters to search for, and a string of characters used to replace them. Consider it a chain of gsub methods, one for each character.
BUT WAIT! THERE'S MORE! If the replacement string is empty, all characters in the search string will be removed.
Hey... how would you validate a full_name field (name surname).
Consider names like:
Ms. Jan Levinson-Gould
Dr. Martin Luther King, Jr.
Brett d'Arras-d'Haudracey
Brüno
Instead of validating the characters that are there, you might just want to ensure some set of characters is not present.
For example:
class User < ActiveRecord::Base
validates_format_of :full_name, :with => /\A[^0-9`!##\$%\^&*+_=]+\z/
# add any other characters you'd like to disallow inside the [ brackets ]
# metacharacters [, \, ^, $, ., |, ?, *, +, (, and ) need to be escaped with a \
end
Tests
Ms. Jan Levinson-Gould # pass
Dr. Martin Luther King, Jr. # pass
Brett d'Arras-d'Haudracey # pass
Brüno # pass
John Doe # pass
Mary-Jo Jane Sally Smith # pass
Fatty Mc.Error$ # fail
FA!L # fail
#arold Newm#n # fail
N4m3 w1th Numb3r5 # fail
Regular expression explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
[^`!##\$%\^&*+_=\d]+ any character except: '`', '!', '#', '#',
'\$', '%', '\^', '&', '*', '+', '_', '=',
digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\z the end of the string
Any validation you perform here is likely to break down unless it is extremely general. For instance, enforcing a minimum length of 3 is probably about as reasonable as you can get without getting into the specifics of what is entered.
When you have names like "O'Malley" with an apostrophe, "Smith-Johnson" with a dash, "Andrés" with accented characters or extremely short names such as "Vo Ly" with virtually no characters at all, how do you validate without excluding legitimate cases? It's not easy.
At least one space and at least 4 char (including the space)
\A(?=.* )[^0-9`!##\\\$%\^&*\;+_=]{4,}\z