Hey... how would you validate a full_name field (name surname).
Consider names like:
Ms. Jan Levinson-Gould
Dr. Martin Luther King, Jr.
Brett d'Arras-d'Haudracey
Brüno
Instead of validating the characters that are there, you might just want to ensure some set of characters is not present.
For example:
class User < ActiveRecord::Base
validates_format_of :full_name, :with => /\A[^0-9`!##\$%\^&*+_=]+\z/
# add any other characters you'd like to disallow inside the [ brackets ]
# metacharacters [, \, ^, $, ., |, ?, *, +, (, and ) need to be escaped with a \
end
Tests
Ms. Jan Levinson-Gould # pass
Dr. Martin Luther King, Jr. # pass
Brett d'Arras-d'Haudracey # pass
Brüno # pass
John Doe # pass
Mary-Jo Jane Sally Smith # pass
Fatty Mc.Error$ # fail
FA!L # fail
#arold Newm#n # fail
N4m3 w1th Numb3r5 # fail
Regular expression explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
[^`!##\$%\^&*+_=\d]+ any character except: '`', '!', '#', '#',
'\$', '%', '\^', '&', '*', '+', '_', '=',
digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\z the end of the string
Any validation you perform here is likely to break down unless it is extremely general. For instance, enforcing a minimum length of 3 is probably about as reasonable as you can get without getting into the specifics of what is entered.
When you have names like "O'Malley" with an apostrophe, "Smith-Johnson" with a dash, "Andrés" with accented characters or extremely short names such as "Vo Ly" with virtually no characters at all, how do you validate without excluding legitimate cases? It's not easy.
At least one space and at least 4 char (including the space)
\A(?=.* )[^0-9`!##\\\$%\^&*\;+_=]{4,}\z
Related
I need to check username format using a regular expression.
My username criterion is:
Must contain 1 or more letters, anywhere.
May contain any amount of numbers, anywhere.
Can contain up to 2 - or _, anywhere.
^[0-9\-_]*[a-z|A-Z]+[0-9\-_]*$ is what I was using but this will reject usernames such as 123hi123hi, or hi123hi. I need something less string location dependent.
I'm using Ruby on Rails to match strings against this.
A very inefficient Ruby function version for Rails is:
validate :check_username
def check_username
if self.username.count("-") > 2
errors.add(:username, "cannot contain more than 2 dashes")
elsif self.username.count("_") > 2
errors.add(:username, "cannot contain more than 2 underscores")
elsif self.username.count("a-zA-Z") < 1
errors.add(:username, "must contain a letter")
elsif (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
errors.add(:username, "cannot contain special characters")
end
end
Here are two approaches you could use.
Construct a single regex
Because regular expressions are concerned with the ordering of characters in a string, one would have to construct a regular expression for each of the following combinations and then "or" those regexes into a single regex.
one letter, zero hyphens, zero underscores
one letter, zero hyphens, one underscores
one letter, zero hyphens, two underscores
one letter, one hyphen, zero underscores
one letter, one hyphen, one underscore
one letter, one hyphen, two underscores
one letter, two hyphens, zero underscores
one letter, two hyphens, one underscore
one letter, two hyphens, two underscores
Digits and additional letters could appear anywhere in the username.
Let's call those regular expressions t0, t1,..., t8. The desired single, overall regular expression would be:
/#{t0}|#{t1}|...|#{t8}/
Let's consider the construction of t4 (one letter, one hyphen, one underscore).
Six possible orders are possible for this combination.
a letter, a hyphen, an underscore
a letter, an underscore, a hyphen
a hyphen, a letter, an underscore
a hyphen, an underscore, a letter
an underscore, a letter, a hyphen
an underscore, a hyphen, a letter
We would need to construct a regular expression for each of these six orders (r0, r1,..., r5) and then "or" them to obtain t4:
t4 = /#{r0}|#{r1}|#{r2}|#{r3}|#{r4}|#{r5}/
Now let's consider the construction of a regex r0 that would implement the first of these orderings (a letter, a hyphen, an underscore):
r0 = /\A[a-z0-9]*[a-z][a-z0-9]*-[a-z0-9]*_[a-z0-9]*\z/i
"3ab4-3cd_e5".match?(r0) #=> true
"3ab4-3cde5".match?(r0) #=> false (no underscore)
"34-3cd_e5".match?(r0) #=> false (no letter before the hyphen)
"3ab4_3cd-e5".match?(r0) #=> false (underscore precedes hyphen)
Construction of each of the other five ri's would be similar.
We would then need to compute ti for each of the eight combination other than the fifth one. t0 (one letter, zero hyphens, zero underscores) is easy:
t0 = /\A[a-z0-9]*[a-z][a-z0-9]*\z/i
By contrast, t8 (one letter, two hyphens, two underscores) would be a much longer regex than t4 (considered above), as a regular expression would have to be hand-crafted for each of 5!/(2!*2!) #=> 30 orderings (r0, r1,..., r29).
It should now be obvious that the use of a single regular expression is simply not the right tool for validating usernames.
Do not construct a single regex
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
cnt[:hyphen] += 1
when '_'
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0 && cnt.fetch(:hyphen, 0) <= 2 &&
cnt.fetch(:underscore, 0) <= 2
end
username_valid? "Bob" #=> true
username_valid? "Bob1_23_-" #=> true
username_valid? "z" #=> true
username_valid? "123--_" #=> false (no letters)
username_valid? "Melba1-23--_" #=> false (3 hyphens)
username_valid? "Bob1_23_-$" #=> false ($ not permitted)
Hash#new with an argument (the default value) of zero is often called a counting hash. If h is a hash with no key k, h[k] returns the default value. It is evaluated thusly:
h[k] += 1
#=> h[k] = h[k] + 1
#=> h[k] = 0 + 1
The method could instead be written to return false as soon as it determines that the regex is incorrect.
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
return false if cnt[:hyphen] == 2
cnt[:hyphen] += 1
when '_'
return false if cnt[:underscore] == 2
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0
end
This is a bad use for a regular expression because your data isn't structured enough. Instead, a small series of simple tests will tell you what you need to know:
def valid?(str)
str[/[a-z]/i] && str.tr('^-_', '').size <= 2
end
%w(123hi123hi hi123hi).each do |username|
username # => "123hi123hi", "hi123hi"
valid?(username) # => true, true
end
There is a loss of speed due to the use of the regular expression
/[a-z]/i
so instead
def valid?(str)
str.downcase.tr('^a-z', '').size >= 0 && str.tr('^-_', '').size <= 2
end
could be used. The use of the regular expression is about 45% slower based on testing.
Breaking it down:
str[/[a-z]/i] tests for a minimum of one character. Since there can be more than one this will suffice.
str.downcase.tr('^a-z', '').size converts the characters to lowercase, then strips all non-letter characters, resulting in only letters remaining, then counts how many there are:
'123hi123hi'.downcase # => "123hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi123hi'.downcase # => "hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi-123_hi'.downcase # => "hi-123_hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
The rule
May contain any amount of numbers, anywhere
isn't worth testing so I ignored it.
This is improved version of that regex of yours
^[\w-]*[A-Za-z]+[\w-]*$
But this will fail to calculate how many - or _ there, so you will need another regex to filter that or count that manually on code.
This is the regex for check only two or less [-_] disregarding its position:
^[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*$
If you're allowing only letters, numbers, dashes and underscores,
and everything else is considered a special character,
I think it's only that the pattern you have needs negation.
Instead of (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
try (self.username =~ /^[^0-9a-zA-Z\-_]+$/) !=0
or (self.username =~ /^[\W-]+$/) > 0.
Also, why not do a count for special characters, like in the conditions above this one?
I'm trying to display an array of words from a user's post. However the method I'm using treats an apostrophe like whitespace.
<%= var = Post.pluck(:body) %>
<%= var.join.downcase.split(/\W+/) %>
So if the input text was: The baby's foot
it would output the baby s foot,
but it should be the baby's foot.
How do I accomplish that?
Accepted answer is too naïve:
▶ "It’s naïve approach".split(/[^'\w]+/)
#⇒ [
# [0] "It",
# [1] "s",
# [2] "nai",
# [3] "ve",
# [4] "approach"
# ]
this is because nowadays there is almost 2016 and many users might want to use their normal names, like, you know, José Østergaard. Punctuation is not only the apostroph, as you might notice.
▶ "It’s naïve approach".split(/[^'’\p{L}\p{M}]+/)
#⇒ [
# [0] "It’s",
# [1] "naïve",
# [2] "approach"
# ]
Further reading: Character Properties.
Along the lines of mudasobwa's answer, here's what \w and \W bring to the party:
chars = [*' ' .. "\x7e"].join
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
That's the usual visible lower-ASCII characters we'd see in code. See the Regexp documentation for more information.
Grabbing the characters that match \w returns:
chars.scan(/\w+/)
# => ["0123456789",
# "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
# "_",
# "abcdefghijklmnopqrstuvwxyz"]
Conversely, grabbing the characters that don't match \w, or that match \W:
chars.scan(/\W+/)
# => [" !\"\#$%&'()*+,-./", ":;<=>?#", "[\\]^", "`", "{|}~"]
\w is defined as [a-zA-Z0-9_] which is not what you want to normally call "word" characters. Instead they're typically the characters we use to define variable names.
If you're dealing with only lower-ASCII characters, use the character-class
[a-zA-Z]
For instance:
chars = [*' ' .. "\x7e"].join
lower_ascii_chars = '[a-zA-Z]'
not_lower_ascii_chars = '[^a-zA-Z]'
chars.scan(/#{lower_ascii_chars}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/#{not_lower_ascii_chars}+/)
# => [" !\"\#$%&'()*+,-./0123456789:;<=>?#", "[\\]^_`", "{|}~"]
Instead of defining your own, you can take advantage of the POSIX definitions and character properties:
chars.scan(/[[:alpha:]]+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/\p{Alpha}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
Regular expressions always seem like a wonderful new wand to wave when extracting information from a string, but, like the Sorcerer's Apprentice found out, they can create havoc when misused or not understood.
Knowing this should help you write a bit more intelligent patterns. Apply that to what the documentation shows and you should be able to easily figure out a pattern that does what you want.
You can use below RegEx instead of /\W+/
var.join.downcase.split(/[^'\w]+/)
/\W/ refers to all non-word characters, apostrophe is one such non-word character.
To keep the code as close to original intent, we can use /[^'\w]/ - this means that all characters that are not apostrophe and word character.
Running that string through irb with the same split call that you wrote in your comment gets this:
irb(main):008:0> "The baby's foot".split(/\W+/)
=> ["The", "baby", "s", "foot"]
However, if you use split without an explicit delimiter, you get the split you're looking for:
irb(main):009:0> "The baby's foot".split
=> ["The", "baby's", "foot"]
Does that get you what you're looking for?
I need to verify that a string has at least one comma but not more than 4 commas.
This is what I've tried:
/,{1,4}/
/,\s{1,4}/
Neither of those work.
Note: I've been testing my RegEx's on Rubular
Any help is greatly appreciated.
Note: I'm using this in the context of an Active Record Validation:
validates :my_string, format: { with: /,\s{1,4}/}
How can do this as an Active Record Validation?
Does it have to be a regex? If not, use Ruby's count method:
> "a,a,a,a,a".count(',')
=> 4
str ="a,b,a,,"
p str.count(",").between?(1, 4) # => true
I too would suggest using count, but to address your specific question, you could do it thusly:
r = /^(?:[^,]*,){1,4}[^,]*$/
!!"eenee"[r]
#=> false
!!"eenee, meenee"[r]
#=> true
!!"eenee, meenee, minee, mo"[r]
#=> true
!!"eenee, meenee, minee, mo, oh, no!"[r]
#=> false
(?:[^,]*,) is a non-capture group that matches any string of characters other than a comma, followed by a comma;
{1,4} ensures that the non-capture group is matched between 1 and 4 times;
the anchor ^ ensures there is no comma before the first non-capture group; and
[^,]*$ ensures there is no comma after the last non-capture group.
I know that this question is over asked, but I couldn't find something that fits in my case and also works with rails. I'm looking for a simple regex for words that can contain:
letters(no digits)
white spaces
. (dot) or - (dash)
The following regex allows letters, white space, dot and dash:
/[a-z\s.-]/i
Your validation in model would be:
validates_format_of :first_name, :with => /[a-z\s.-]/i
Just to add to Sharvy Ahmed's answer.
You can also represent the validation in this format in the model
Define the Regex as a constant and then reference it in the first_name validation.
validates :first_name, format: { with: VALID_NAME_REGEX }
VALID_NAME_REGEX = /[a-z\s.-]/i
Where:
/ - Indicates the start of a new character
a-z - Matches characters in the range 'a' to 'z'
\s - Matches any whitespace character like tabs, spaces
. - Matches a '.' character
- - Matches a '-' character
/i - Makes the whole expression case insensitive.
That's all.
I hope this helps
Two questions that I believe are connected and I think are regex related but have me stumped after some fruitless googling.
validates :image_url, format: { with: %r{\.(gif|jpg)\Z}i }
My guesses: similar to ruby/regex i = ignore case, single pipe means 'or'. Guessing \Z means end of string. The brackets are just containers unlike ruby/regex where they signify something wildly different.
But what does the %r do? I haven't run across that in ruby/regex.
ok_urls = %w{ fred.gif fred.jpg FRED.Jpg}
%r and %w seem to be doing the same thing so I'm confused why there are two separate commands to do the same thing. Sorry if this isn't very clear.
A Regexp holds a regular expression, used to match a pattern against strings. Regexps are created using the /.../ and %r{...} literals, and by the Regexp::new constructor.
%r and %w seem to be doing the same thing so I'm confused..
%w{ fred.gif fred.jpg FRED.Jpg}
# => ["fred.gif", "fred.jpg", "FRED.Jpg"]
%r{ a b }
# => / a b /
No. They are not same, as you can see above.
One thing I noticed with %r{}, as you don't need to escape slashes.
# /../ literals:
url.match /http:\/\/example\.com\//
# => #<MatchData "http://example.com/">
# %r{} literals:
url.match %r{http://example\.com/}
# => #<MatchData "http://example.com/">
Use %r only for regular expressions matching more than one '/' character.
# bad
%r(\s+)
# still bad
%r(^/(.*)$)
# should be /^\/(.*)$/
# good
%r(^/blog/2011/(.*)$)