Why can't regular expressions match for # sign? - ruby-on-rails

For the string Be there # six.
Why does this work:
str.gsub! /\bsix\b/i, "seven"
But trying to replace the # sign doesn't match:
str.gsub! /\b#\b/i, "at"
Escaping it doesn't seem to work either:
str.gsub! /\b\#\b/i, "at"

This is down to how \b is interpreted. \b is a "word boundary", wherein a zero-length match occurs if \b is preceded by or followed by a word character. The word characters are limited to [A-Za-z0-9_] and maybe a few other things, but # is not a word character, so \b won't match just before it (and after a space). The space itself is not the boundary.
More about word boundaries...
If you need to replace the # with surrounding whitespace, you can capture it after the \b and use backreferences. This captures preceding whitespace with \s* for zero or more space characters.
str.gsub! /\b(\s*)#(\s*)\b/i, "\\1at\\2"
=> "Be there at six"
Or to insist upon whitespace, use \s+ instead of \s*.
str = "Be there # six."
str.gsub! /\b(\s+)#(\s+)\b/i, "\\1at\\2"
=> "Be there at six."
# No match without whitespace...
str = "Be there#six."
str.gsub! /\b(\s+)#(\s+)\b/i, "\\1at\\2"
=> nil
At this point, we're starting to introduce redundancies by forcing the use of \b. It could just as easily by done with /(\w+\s+)#(\s+\w+)/, foregoing the \b match for \w word characters followed by \s whitespace.
Update after comments:
If you want to treat # like a "word" which may appear at the beginning or end, or inside bounded by whitespace, you may use \W to match "non-word" characters, combined with ^$ anchors with an "or" pipe |:
# Replace # at the start, middle, before punctuation
str = "# Be there # six #."
str.gsub! /(^|\W+)#(\W+|$)/, '\\1at\\2'
=> "at Be there at six at."
(^|\W+) matches either ^ the start of the string, or a sequence of non-word characters (like whitespace or punctuation). (\W+|$) is similar but can match the end of the string $.

\b matches a word boundary, which is where a word character is next to a non-word character. In your string the # has a space on each side, and neither # or space are word characters so there is no match.
Compare:
'be there # six'.gsub /\b#\b/, 'at'
produces
'be there # six'
(i.e. no changes)
but
'be there#six'.gsub /\b#\b/, 'at' # no spaces around #
produces
"be thereatsix"
Also
'be there # six'.gsub /#/, 'at' # no word boundaries in regex
produces
"be there at six"

Related

Rails string split every other "."

I have a bunch of sentences that I want to break into an array. Right now, I'm splitting every time \n appears in the string.
#chapters = #script.split('\n')
What I'd like to do is .split ever OTHER "." in the string. Is that possible in Ruby?
You could do it with a regex, but I'd start with a simple approach: just split on periods, then join pairs of substrings:
s = "foo. bar foo. foo bar. boo far baz. bizzle"
s.split(".").each_slice(2).map {|p| p.join "." }
# => => ["foo. bar foo", " foo bar. boo far baz", " bizzle"]
This is a case where it's easier to use String#scan than String#split.
We can use the following regular expression:
r = /(?<=\.|\A)[^.]*\.[^.]*(?=\.|\z)/
str=<<~_
Now is the time. This is it. It is now. The time to have fun.
The time to make new friends. The time to party.
_
str.scan(r)
#=> [
# "Now is the time. This is it",
# " It is now. The time to have fun",
# "\nThe time to make new friends. The time to party"
#=> ]
We can write the regular expression in free-spacing mode to make it self-documenting.
r = /
(?<= # begin a positive lookbehind
\A # match the beginning of the string
| # or
\. # match a period
) # end positive lookbehind
[^.]* # match zero or more characters other than periods
\. # match a period
[^.]* # match zero or more characters other than periods
(?= # begin a positive lookahead
\. # match a period
| # or
\z # match the end of the string
) # end positive lookahead
/x # invoke free-spacing regex definition mode
Note that (?<=\.|\A) can be replaced with (?<![^\.]). (?<![^\.]) is a negative lookbehind that asserts the match is not preceded by a character other than a period.
Similarly, (?=\.|\z) can be replaced with (?![^.]). (?![^.]) is a negative lookahead that asserts the match is not followed by a character other than a period.

How do I replace all the apostrophes that come right before or right after a comma?

I have a string aString = "old_tag1,old_tag2,'new_tag1','new_tag2'"
I want to replace the apostrophees that come right before or right after a comma. For example in my case the apostrophees enclosing new_tag1 and new_tag2 should be removed.
This is what I have right now
aString = aString.gsub("'", "")
This is however problematic as it removes any apostrophe inside for example if I had 'my_tag's' instead of 'new_tag1'. How do I get rid of only the apostrophes that come before or after the commas ?
My desired output is
aString = "old_tag1,old_tag2,new_tag1,new_tag2"
My guess is to use regex as well, but in a slightly other way:
aString = "old_tag1,old_tag2,'new_tag1','new_tag2','new_tag3','new_tag4's'"
aString.gsub /(?<=^|,)'(.*?)'(?=,|$)/, '\1\2\3'
#=> "old_tag1,old_tag2,new_tag1,new_tag2,new_tag3,new_tag4's"
The idea is to find a substring with bounding apostrophes and paste it back without it.
regex = /
(?<=^|,) # watch for start of the line or comma before
' # find an apostrophe
(.*?) # get everything between apostrophes in a non-greedy way
' # find a closing apostrophe
(?=,|$) # watch after for the comma or the end of the string
/x
The replacement part just paste back the content of the first, second, and third groups (everything between parenthesis).
Thanks for #Cary for /x modificator for regexes, I didn't know about it! Extremely useful for explanation.
This answers the question, "I want to replace the apostrophes that come right before or right after a comma".
r = /
(?<=,) # match a comma in a positive lookbehind
\' # match an apostrophe
| # or
\' # match an apostrophe
(?=,) # match a comma in a positive lookahead
/x # free-spacing regex definition mode
aString = "old_tag1,x'old_tag2'x,x'old_tag3','new_tag1','new_tag2'"
aString.gsub(r, '')
#=> => "old_tag1,x'old_tag2'x,x'old_tag3,new_tag1,new_tag2'"
If the objective is instead to remove single quotes enclosing a substring when the left quote is at the the beginning of the string or is immediately preceded by a comma and the right quote is at the end of the string or is immediately followed by comma, several approaches are possible. One is to use a single, modified regex, as #Dimitry has done. Another is to split the string on commas, process each string in the resulting array and them join the modified substrings, separated by commas.
r = /
\A # match beginning of string
\' # match single quote
.* # match zero or more characters
\' # match single quote
\z # match end of string
/x # free-spacing regex definition mode
aString.split(',').map { |s| (s =~ r) ? s[1..-2] : s }.join(',')
#=> "old_tag1,x'old_tag2'x,x'old_tag3',new_tag1,new_tag2"
Note:
arr = aString.split(',')
#=> ["old_tag1", "x'old_tag2'x", "x'old_tag3'", "'new_tag1'", "'new_tag2'"]
"old_tag1" =~ r #=> nil
"x'old_tag2'x" =~ r #=> nil
"x'old_tag3'" =~ r #=> nil
"'new_tag1'" =~ r #=> 0
"'new_tag2'" =~ r #=> 0
Non regex replacement
Regular expressions can get really ugly. There is a simple way to do it with just string replacement: search for the pattern ,' and ', and replace with ,
aString.gsub(",'", ",").gsub("',", ",")
=> "old_tag1,old_tag2,new_tag1,new_tag2'"
This leaves the trailing ', but that is easy to remove with .chomp("'"). A leading ' can be removed with a simple regex .gsub(/^'/, "")

How can I construct a regular expression to account for non-consecutive characters?

I'm currently using this regex for my names \A^[a-zA-Z'.,\s-]*\z; however, I don't want there to be any consecutive characters for a apostrophe, period, comma, whitespace, or hyphen. How can I do this?
The significant part would be (?:[a-zA-Z]|['.,\s-](?!['.,\s-])).
Meaning:
(?:
[a-zA-Z] # letters
| # or
['.,\s-] # any of these
(?!['.,\s-]) # but in front can not be another of these
)
But, in this case:
Guedes, Washington
------^^----------
Would invalidate the name, so maybe you want remove \s from the negative look-ahead.
Hope it helps.
How about this (string of letters, potentially ending with one of those terminator chars)
\A^[a-zA-Z]*['.,\s-]?\z

Regex to extract number between brackets that are after a hashtag in ruby

I have strings in the format:
'I had a great time with #[2468] and #[1357]! #[1111] #[2321]#[1212]'
I want to be able to extract the numbers between the # and # symbols, but I do not want the included square brackets. For example I would like to return:
user_ids = [2468, 1357]
hash_tag_ids = [1111, 2321, 1212]
Any ideas?
Because you want to match all occurrences of the pattern, the string.scan method is what you want. Scan automatically returns everything that matches the pattern, so you don't need to use "capture groups" (the parentheses you see in most regular expressions), but you do need to use "lookahead" and "lookbehind" to match some stuff without including it in your result.
The two lines you need are:
string.scan(/(?<=#\[)\d+(?=\])/).map(&:to_i) # => [2468, 1357]
string.scan(/(?<=#\[)\d+(?=\])/).map(&:to_i) # => [1111, 2321, 1212]
The (?<=...) creates a "positive lookbehind" which ensures that the preceding characters match ..., but those characters aren't included in the matched text. In other words, (?<=#\[) will match "#[", but "#[" will not be included in the results returned by string.scan.
Notice the opening square bracket, and the closing square bracket have a slash in front of them. This is because square brackets have special meaning in a regular expression (they create a "character class"), but since we want to match a literal square bracket, we must "escape" them with a slash.
\d+ means to match 1 or more digits.
(?=...) creates a "positive lookahead" which ensures that the following characters match ..., but those characters aren't included in the matched text. Same as the lookbehind above, but checks the following characters instead of the preceding characters. In this case, (?=\]) matches "]" without including the "]" in the results returned by string.scan.
string.scan will return an array of strings. The .map(&:to_i) part will run string.to_i on each string to return an actual integer value.
string.scan(/(?<=#\[)[^\]]*(?=\])/) # => ["2468", "1357"]
string.scan(/(?<=#\[)[^\]]*(?=\])/) # => ["1111", "2321", "1212"]

Difference between \b and \s in Regular Expression

I was learning regular expression in iOS, saw this tutorial:http://www.raywenderlich.com/30288/nsregularexpression-tutorial-and-cheat-sheet
It reads like this for \b:
\b matches word boundary characters such as spaces and punctuation. to\b will match the "to" in "to the moon" and "to!", but it will not match "tomorrow". \b is handy for "whole word" type matching.
and \s:
\s matches whitespace characters such as spaces, tabs, and newlines. hello\s will match "hello " in "Well, hello there!".
I have two questions on this:
1) what is the difference between \s and \b? when to use which?
2) \b is handy for "whole word" type matching -> Don't understand the meaning..
Need some guidance on these two.
\b Boundary characters
\b matches the boundary itself but not the boundary character (like a comma or period). It has no length in itself but can be used to find for example e in the end of a word.
For example in the sentence: "Hello there, this is one test. Testing"
The regex e\b will match an e if it's at the end of the word (followed by a word boundary). Notice in the image below that the e in "test" and "Testing" didn't match since the "e" is not followed by a boundary.
\s Whitespace
\s on the other hand matches the actual white space characters (like spaces and tabs). In the same sentence it will match all the spaces between the words.
Edit
Since \b doesn't make much sense alone I showed to how to it as e\b (above). The OP asked (in a comment) about what e\s would match compared to e\b to better explain the difference between \b and \s.
In the same string there is only one match for e\s while there was two matches for e\b since the comma is not a whitespace. Note that the e\s match (image 3) includes the white space where as the e\b match doesn't (image 1).
\b is matching a word boundary. That is a zero width assertion, means it is not matching a character, it is matching a position, where a certain condition is true.
\b is related to \w. \w is defining "word characters", means letters, digits and underscores. So \b is now matching on a change from a word character to a non-word character, or the other way round. Means it matches the start and end of a word, but not the character before or after the word.
\s is a predefined character class that is matching any whitespace character.
See and try out what \bFoo\b matches here on Regexr
See and try out what \sFoo\s matches here on Regexr
\b is zero-width. That is, it doesn't actually match any character. Meanwhile, \s does match a character. This is an important distinction for capturing and more complicated regular expressions.
For example, say you're trying to match numbers that begin with multiple zeros, like 007 or 000101101. You might try:
0+\d*
But see, that would also match 1007 and 101000101101! So then, you might try:
\s0+\d*
But see how that wouldn't match a 007 at the beginning of the string (because there's no space character)? Using \b allows you to get the "whole word (or number)":
\b0+\d*
\b matches any character that is not a letter or number without including itself in the match.
\s matches only white space.
For example:
\b would match any of these: "!?,.##$%^&*()_+ ".
$text = "Hello, Yo! moo .";
$regex = "~o\b~";
^---Will match all three o's.
$text = "Hello, Yo! moo .";
$regex = "~o\s~";
^---Will only match the 'o' in 'moo'.

Resources