Difference in regex between Rails and Ruby? - ruby-on-rails

I am attempting to match a value like 'MN+WI' at the end of a URL, for example /foos/MN+WI. The pattern [a-zA-Z][\+\,]? produces a match result of MN+WI on rubular.com, but in IRB:
s="MI+WI"
p="[a-zA-Z]{2}[\+\,]?"
r=Regexp.new(p)
r.match(s) # => #<MatchData "MI+">
The behavior in Ruby console is consistent with what I am encountering with Rails. Is there a difference between the two? How do I need to adjust my regex pattern?
$ ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.3.0]
$ rails -v
Rails 4.0.0
** edit **
Original pattern should have been [a-zA-Z]{2}[\+\,]?.
What I really need to have a route recognize any of these variations and assign it to a param:
MN (working)
mn (working)
MN+WI (not working)
MN+WI+IA (arbitrary number of 2-letter value, separated by a +)
not match single or more than 2-letter values (e.g. ABC), but keep 2-letter values (e.g. ABC+MN; keep MN)

As I said in my comment, [a-zA-Z][\+\,]? does not match MN+WI. What you are seeing on Rubular is actually two matches. The first match is MN+, and the second match WI. Rubular just highlights all the matches, so it looks like one long match but it is actually two matches. The behavior should be consistent between Rubular and your local Ruby install.

Your regexp means "2 letters followed by optional + or ,". So your string has 2 matches. Rubular highlights all matches, and it looks like the whole string is matched, but in reality there are 2 different matches = MN+ and WI

Rubular is showing the result of the repeated application of the pattern:
[a-zA-Z][\+\,]?
If you put that pattern in a capture group, you'll see each of the individual matches (see http://rubular.com/r/h5iBa5k0fr), each of which matches a single character except for N+.
Your IRB code returns a single match. Note also, though, that your IRB code is different than the above regex due to your inclusion of {2}.

Related

Ruby .scan method returns empty using regex

So given a string like this "\"turkey AND ham\" NOT \"roast beef\"" I need to get an array with the inner strings like so: ["turkey AND ham", "roast beef"] and eliminate OR's, AND's and NOT's that may or may not be there.
With the help of Rubular I came up with this regex /\\["']([^"']*)\\["']/
which returns the following 2 groups:
Match 1
1. turkey AND ham
Match 2
1. roast beef
however when I use it with .scan keep getting and empty array.
I looked at this and this other SO posts, and a few others, but can not figure out where I am going wrong
Here is the result from my rails console:
=> q = "\"turkey and ham\" OR \"roast beef\""
=> q.scan(/\\["']([^"']*)\\["']/)
=> []
Expectation:
["turkey AND ham", "roast beef"]
I shall also mention I suck at regex.
When the regex used with scan contains a capture group (#davidhu2000's approach), one generally can use lookarounds1 instead. It's just a matter of personal preference. To allow for double-quoted strings that contain either single- or (escaped) double-quoted strings, you could use the following regex.
r = /
(?<=") # match a double quote in a positive lookbehind
[^"]+ # match one or more characters that are not double-quotes
(?=") # match a double quote in a positive lookahead
| # or
(?<=') # match a single quote in a positive lookbehind
[^']+ # match one or more characters that are not single-quotes
(?=') # match a single quote in a positive lookahead
/x # free-spacing regex definition mode
"\"turkey AND ham\" NOT 'roast beef'".scan(r)
#=> ["turkey AND ham", "roast beef"]
As '"turkey AND ham" NOT "roast beef"' #=> "\"turkey AND ham\" NOT \"roast beef\"" (i.e., how the single-quoted string is saved), we need not be concerned about that being an additional case to deal with.
1 For any in the audience who still consider regular expressions to be black magic, there are four kinds of lookarounds (positive and negative lookbehinds and lookaheads) as elaborated in the doc for Regexp. Sometimes they are regarded as "zero-width" matches as they are not part of the matched text.
You regex is trying to match \, which won't match anything in the string, since the \ existed to escape the double quote, and won't be part of the string.
So if you remove \\ in your regex
res = q.scan(/["']([^"']*)["']/)
This will return a 2d array
res = [["turkey and ham"], ["roast beef"]]
Each inner array is all the matching groups from the regex, so if you have two capture groups in your regex, you will see two items in the inner array.
If you want a simple array, you can run flatten method on the array.

Why is my expression matching something that clearly doesn't meet the expression?

I'm using Rails 4.2.7. I want to match the pattern, numbers, an arbitrary number of spaces, and a potential "+" at the end. So I wrote
2.3.0 :013 > /\d+\s*\+?/.match("40+")
=> #<MatchData "40+">
However, this is also matching
2.3.0 :012 > /\d+\s*\+?/.match("40-50")
=> #<MatchData "40">
What gives? The string "40-50" doesn't match the expression provided, but clearly I'm not doing something right in my regex.
If you want it to match only the full string, use the \A and \z markers. Like this:
/\A\d+\s*\+\z$/.match("40-50")
This will force the regular expression to match only if the full string match it.
Otherwise, the way you have it now, it will stop as soon as it find a match anywhere in your string.
\d+ means one digit or more.
\s* means an optional space.
\+? means an optional + sign
So 40 does match those conditions.
The ? Quantifier is Your Problem
You want the following instead:
/\d+\s*\+/.match("40-50")
If you use \+? instead of \+ without a quantifier, the question mark modifies the statement to mean "zero or one of the preceding atoms." With the quantifier, 40-50 matches since it has zero + characters in the string.

iOS regex pattern not working to match numbers

I want to match strings of the forms:
123
.123
1.123
and I am using the following string for my regex
#"^\\d*(?:\\.\\d+)?$"
However, it matches strings of the following forms as well
1.2.3
1..2..3
123...
What's wrong with my regex? I used the ^ and $ because I don't want the string to contain anything other than the number forms mentioned.
EDIT:
I logged what is matched in the string like 78..7 and found that the match location is 0 and length is 0 with a result of "" being matched. Any ideas? Shouldn't the range location be NSNotFound if the length is 0? I suppose the regex expression is fine then and I can just check for !length but that seems like an unnecessary work around.
Try this regex:
^(?<!\.)\d*(\.\d+)?$
I added a negative look-behind assertion that means that no dot is allowed before that numbers. That should fix your problem.
Description
This regex will find valid positive real numbers with or without a decimal point. like 123, .123, 1.123. The expression can be applied against a string where each value tested is on it's own line or find numbers in the middle of a block of text. It will also allow punctuation like periods and commas directly after the number but won't capture them.
(?<=^|\s)\d*\.?\d+(?=[,.;]?(?:\s|$))
Given Input String:
1.2.3
1..2..3
128...
1234
.123
1.123
1...23
1.2.3
123...
I like kittens 345.23, and version 2.3.4 dogs
Matches are:
1234
.123
1.123
345.23
Does this work for you?
#"^\\d*\\.?\\d+$"
Here it is without escaped backslashes:
^\d*\.?\d+$
My best guess is that rekire is right about the $ symbol not working. If that's the case, then the regex does actually match the empty substring at the start of the string, which explains why it says it's found a match of length 0 at location 0, instead of NSNotFound.
This is REGEX match your strings:
[0-9]*(.){0,1}[0-9]+

Retaining the pattern characters while splitting via Regex, Ruby

I have the following string
str="HelloWorld How areYou I AmFine"
I want this string into the following array
["Hello","World How are","You I Am", "Fine"]
I have been using the following regex, it splits correctly but it also omits the matching pattern, i also want to retain that pattern.
What i get is
str.split(/[a-z][A-Z]/)
=> ["Hell", "orld How ar", "ou I A", "ine"]
It omitts the matching pattern.
Can any one help me out how to retain these characters as well in the resulting array
In Ruby 1.9 you can use positive lookahead and positive lookbehind (lookahead and lookbehind regex constructs are also called zero-width assertions). They match characters, but then give up the match and only return the result, thus you won't loose your border characters:
str.split /(?<=[a-z])(?=[A-Z])/
=> ["Hello", "World How are", "You I Am", "Fine"]
Ruby 1.8 does not support lookahead/lookbehind constructs. I recommend to use ruby 1.9 if possible.
If you are forced to use ruby 1.8.7, I think regex won't help you and the best solution I can think of is to build a simple state machine: iterate over each character in your original string and build first string until you encounter border condition. Then build second string etc.
Three answers so far, each with a limitation: one is rails-only and breaks with underscore in original string, another is ruby 1.9 only, the third always has a potential error with its special character. I really liked the split on zero-width assertion answer from #Alex Kliuchnikau, but the OP needs ruby 1.8 which doesn't support lookbehind. There's an answer that uses only zero-width lookahead and works fine in 1.8 and 1.9 using String#scan instead of #split.
str.scan /.*?[a-z](?=[A-Z]|$)/
=> ["Hello", "World How are", "You I Am", "Fine"]
I think this will do the job for you
str.underscore.split(/_/).each do |s|
s.capitalize!
end

Assistance with Some Interesting Syntax in Some Ruby Code I've Found

I'm currently reading Agile Web Development With Rails, 3rd edition. On page 672, I came across this method:
def capitalize_words(string)
string.gsub(/\b\w/) { $&.upcase }
end
What is the code in the block doing? I have never seen that syntax. Is it similar to the array.map(&:some_method) syntax?
It's Title Casing The Input. inside the block, $& is a built-in representing the current match (\b\w i.e. the first letter of each word) which is then uppercased.
You've touched on one of the few things I don't like about Ruby :)
The magic variable $& contains the matched string from the previous successful pattern match. So in this case, it'll be the first character of each word.
This is mentioned in the RDoc for String.gsub:
http://ruby-doc.org/core/classes/String.html#M000817
gsub replaces everything that matched in the regex with the result of the block. so yes, in this case you're matching the first letter of words, then replacing it with the upcased version.
as to the slightly bizarre syntax inside the block, this is equivalent (and perhaps easier to understand):
def capitalize_words(string)
string.gsub(/\b\w/) {|x| x.upcase}
end
or even slicker:
def capitalize_words(string)
string.gsub /\b\w/, &:upcase
end
as to the regex (courtesy the pickaxe book), \b matches a word boundary, and \w any 'word character' (alphanumerics and underscore). so \b\w matches the first character of the word.

Resources