I need to check username format using a regular expression.
My username criterion is:
Must contain 1 or more letters, anywhere.
May contain any amount of numbers, anywhere.
Can contain up to 2 - or _, anywhere.
^[0-9\-_]*[a-z|A-Z]+[0-9\-_]*$ is what I was using but this will reject usernames such as 123hi123hi, or hi123hi. I need something less string location dependent.
I'm using Ruby on Rails to match strings against this.
A very inefficient Ruby function version for Rails is:
validate :check_username
def check_username
if self.username.count("-") > 2
errors.add(:username, "cannot contain more than 2 dashes")
elsif self.username.count("_") > 2
errors.add(:username, "cannot contain more than 2 underscores")
elsif self.username.count("a-zA-Z") < 1
errors.add(:username, "must contain a letter")
elsif (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
errors.add(:username, "cannot contain special characters")
end
end
Here are two approaches you could use.
Construct a single regex
Because regular expressions are concerned with the ordering of characters in a string, one would have to construct a regular expression for each of the following combinations and then "or" those regexes into a single regex.
one letter, zero hyphens, zero underscores
one letter, zero hyphens, one underscores
one letter, zero hyphens, two underscores
one letter, one hyphen, zero underscores
one letter, one hyphen, one underscore
one letter, one hyphen, two underscores
one letter, two hyphens, zero underscores
one letter, two hyphens, one underscore
one letter, two hyphens, two underscores
Digits and additional letters could appear anywhere in the username.
Let's call those regular expressions t0, t1,..., t8. The desired single, overall regular expression would be:
/#{t0}|#{t1}|...|#{t8}/
Let's consider the construction of t4 (one letter, one hyphen, one underscore).
Six possible orders are possible for this combination.
a letter, a hyphen, an underscore
a letter, an underscore, a hyphen
a hyphen, a letter, an underscore
a hyphen, an underscore, a letter
an underscore, a letter, a hyphen
an underscore, a hyphen, a letter
We would need to construct a regular expression for each of these six orders (r0, r1,..., r5) and then "or" them to obtain t4:
t4 = /#{r0}|#{r1}|#{r2}|#{r3}|#{r4}|#{r5}/
Now let's consider the construction of a regex r0 that would implement the first of these orderings (a letter, a hyphen, an underscore):
r0 = /\A[a-z0-9]*[a-z][a-z0-9]*-[a-z0-9]*_[a-z0-9]*\z/i
"3ab4-3cd_e5".match?(r0) #=> true
"3ab4-3cde5".match?(r0) #=> false (no underscore)
"34-3cd_e5".match?(r0) #=> false (no letter before the hyphen)
"3ab4_3cd-e5".match?(r0) #=> false (underscore precedes hyphen)
Construction of each of the other five ri's would be similar.
We would then need to compute ti for each of the eight combination other than the fifth one. t0 (one letter, zero hyphens, zero underscores) is easy:
t0 = /\A[a-z0-9]*[a-z][a-z0-9]*\z/i
By contrast, t8 (one letter, two hyphens, two underscores) would be a much longer regex than t4 (considered above), as a regular expression would have to be hand-crafted for each of 5!/(2!*2!) #=> 30 orderings (r0, r1,..., r29).
It should now be obvious that the use of a single regular expression is simply not the right tool for validating usernames.
Do not construct a single regex
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
cnt[:hyphen] += 1
when '_'
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0 && cnt.fetch(:hyphen, 0) <= 2 &&
cnt.fetch(:underscore, 0) <= 2
end
username_valid? "Bob" #=> true
username_valid? "Bob1_23_-" #=> true
username_valid? "z" #=> true
username_valid? "123--_" #=> false (no letters)
username_valid? "Melba1-23--_" #=> false (3 hyphens)
username_valid? "Bob1_23_-$" #=> false ($ not permitted)
Hash#new with an argument (the default value) of zero is often called a counting hash. If h is a hash with no key k, h[k] returns the default value. It is evaluated thusly:
h[k] += 1
#=> h[k] = h[k] + 1
#=> h[k] = 0 + 1
The method could instead be written to return false as soon as it determines that the regex is incorrect.
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
return false if cnt[:hyphen] == 2
cnt[:hyphen] += 1
when '_'
return false if cnt[:underscore] == 2
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0
end
This is a bad use for a regular expression because your data isn't structured enough. Instead, a small series of simple tests will tell you what you need to know:
def valid?(str)
str[/[a-z]/i] && str.tr('^-_', '').size <= 2
end
%w(123hi123hi hi123hi).each do |username|
username # => "123hi123hi", "hi123hi"
valid?(username) # => true, true
end
There is a loss of speed due to the use of the regular expression
/[a-z]/i
so instead
def valid?(str)
str.downcase.tr('^a-z', '').size >= 0 && str.tr('^-_', '').size <= 2
end
could be used. The use of the regular expression is about 45% slower based on testing.
Breaking it down:
str[/[a-z]/i] tests for a minimum of one character. Since there can be more than one this will suffice.
str.downcase.tr('^a-z', '').size converts the characters to lowercase, then strips all non-letter characters, resulting in only letters remaining, then counts how many there are:
'123hi123hi'.downcase # => "123hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi123hi'.downcase # => "hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi-123_hi'.downcase # => "hi-123_hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
The rule
May contain any amount of numbers, anywhere
isn't worth testing so I ignored it.
This is improved version of that regex of yours
^[\w-]*[A-Za-z]+[\w-]*$
But this will fail to calculate how many - or _ there, so you will need another regex to filter that or count that manually on code.
This is the regex for check only two or less [-_] disregarding its position:
^[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*$
If you're allowing only letters, numbers, dashes and underscores,
and everything else is considered a special character,
I think it's only that the pattern you have needs negation.
Instead of (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
try (self.username =~ /^[^0-9a-zA-Z\-_]+$/) !=0
or (self.username =~ /^[\W-]+$/) > 0.
Also, why not do a count for special characters, like in the conditions above this one?
Related
I have a piece of code, where I can switch words from #post.swap_content to hyperlinks by keyword. For example, I have a word 'michigan' in #post.swap_content and I have keyword 'Michigan' in keywords, so it would switch it to the hyperlink that attached to keyword. Here is part of the function:
def execute
all_keys = Keyword.all.pluck(:key, :link).to_h.transform_keys(&:downcase)
#post.swap_content = #post.swap_content.to_s.gsub!(/\w+/) do |word|
url = all_keys[word.downcase]
url ? "<a href='#{url}'>#{word}</a>" : word
end
#post.save!
end
And my question is - how can I make it gsub only the first two keywords in #post.swap_content? For example, I have #post.swap_content 'michigan, michigan and michigan, utah and utah', how can I switch to hyperlinks only first two keywords(first two 'michigan' and first two 'utah')? I think, that I need somehow to work gsub but I don't know hot to manage number of words that can be gsub.
You can provide a block to gsub that will be invoked with each match, you could use this to count occurences and condtionally replace content.
str = "Dog dog dog cat cat cat"
occurences = {}
str.gsub(/\w+/) do |match|
# downcase so Dog and dog are counted together
key = match.downcase
# build a hash which counts the number of times we've matched a word.
count = occurences.store(key, occurences.fetch(key, 0).next)
# return the word unchanged or wrap in a hyperlink depending on count
count > 2 ? match : "<a>#{match}</a>"
end
# output => "<a>Dog</a> <a>dog</a> dog <a>cat</a> <a>cat</a> cat"
Suppose:
str = "Dog dog cat dog cat Dog cat cat"
If Ruby's regex engine supported variable-length negative lookbehinds we could write:
R = /\b(\w+)\b(?<!(?:\b\1\b.*){2})/i
str.gsub(R, '<a>\1</a>')
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
We can write this regular expression in free-spacing mode to make it self-documenting:
R = /
\b # assert a word break
(\w+) # match 1+ word characters and save to capture group 1
\b # assert a word break
(?! # begin a negative lookbehind
(?: # begin a non-capture group
\b # assert a word break
\1 # match the content of capture group 1
\b # assert a word break
.* # match 0+ characters
) # end non-capture group
{2} # execute non-capture group twice
) # end negative lookbehind
/ix # assert case-independent and free-spacing regex def modes
Unfortunately, Ruby's regex engine does not support variable-length (positive or negative) lookbehinds (though one day it might). It does, however, support variable-length (positive and negative) lookaheads. We therefore could reverse the string, perform the desired replacements using gsub then reverse the resulting string, as follows:
R = /\b(\w+)\b(?!(?:.*\b\1\b){2})/i
str.reverse.gsub(R, '>a/<\1>a<').reverse
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
The steps are as follows.
s = str.reverse
#=> "tac tac goD tac god tac god goD"
t = s.gsub(R, '>a/<\1>a<')
#=> "tac tac goD >a/<tac>a< god >a/<tac>a< >a/<god>a< >a/<goD>a<"
t.reverse
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
Let's have a closer look at the regular expression.
R = /
\b # assert a word break
(\w+) # match 1+ word characters and save to capture group 1
\b # assert a word break
(?! # begin a negative lookahead
(?: # begin a non-capture group
.* # match 0+ characters
\b # assert a word break
\1 # match the content of capture group 1
\b # assert a word break
) # end non-capture group
{2} # execute non-capture group twice
) # end negative lookahead
/ix # assert case-independent and free-spacing regex def modes
I have a query string which I want to separate out
created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30' AND updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30' AND user_id = 5 AND status = 'closed'
Like this
created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'
updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'
user_id = 5
status = 'closed'
This is just an example string, I want to separate the query string dynamically. I know can't just split with AND because of the pattern like BETWEEN .. AND
You might be able to do this with regex but here's a parser that may work for your use case. It can surely be improved but it should work.
require 'time'
def parse(sql)
arr = []
split = sql.split(' ')
date_counter = 0
split.each_with_index do |s, i|
date_counter = 2 if s == 'BETWEEN'
time = Time.parse(s.strip) rescue nil
date_counter -= 1 if time
arr << i+1 if date_counter == 1
end
arr.select(&:even?).each do |index|
split.insert(index + 2, 'SPLIT_ME')
end
split = split.join(' ').split('SPLIT_ME').map{|l| l.strip.gsub(/(AND)$/, '')}
split.map do |line|
line[/^AND/] ? line.split('AND') : line
end.flatten.select{|l| !l.empty?}.map(&:strip)
end
This is not really a regex, but more a simple parser.
This works by matching a regex from the start of the string until it encounters a whitespace followed by either and or between followed by a whitespace character. The result is removed from the where_cause and saved in statement.
If the start of the string now starts with a whitespace followed by between followed by a whitespace. It is added to statement and removed from where_cause with anything after that, allowing 1 and. Matching stops if the end of the string is reached or another and is encountered.
If point 2 didn't match check if the string starts with a whitespace followed by and followed by a whitespace. If this is the case remove this from where_cause.
Finally add statement to the statements array if it isn't an empty string.
All matching is done case insensitive.
where_cause = "created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30' AND updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30' AND user_id = 5 AND status = 'closed'"
statements = []
until where_cause.empty?
statement = where_cause.slice!(/\A.*?(?=[\s](and|between)[\s]|\z)/mi)
if where_cause.match? /\A[\s]between[\s]/i
between = /\A[\s]between[\s].*?[\s]and[\s].*?(?=[\s]and[\s]|\z)/mi
statement << where_cause.slice!(between)
elsif where_cause.match? /\A[\s]and[\s]/i
where_cause.slice!(/\A[\s]and[\s]/i)
end
statements << statement unless statement.empty?
end
pp statements
# ["created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'",
# "updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'",
# "user_id = 5",
# "status = 'closed'"]
Note: Ruby uses \A to match the start of the string and \z to match the end of a string instead of the usual ^ and $, which match the beginning and ending of a line respectively. See the regexp anchor documentation.
You can replace every [\s] with \s if you like. I've added them in to make the regex more readable.
Keep in mind that this solution isn't perfect, but might give you an idea how to solve the issue. The reason I say this is because it doesn't account for the words and/between in column name or string context.
The following where cause:
where_cause = "name = 'Tarzan AND Jane'"
Will output:
#=> ["name = 'Tarzan", "Jane'"]
This solution also assumes correctly structured SQL queries. The following queries don't result in what you might think:
where_cause = "created_at = BETWEEN AND"
# TypeError: no implicit conversion of nil into String
# ^ does match /\A[\s]between[\s]/i, but not the #slice! argument
where_cause = "id = BETWEEN 1 AND 2 BETWEEN 1 AND 3"
#=> ["id = BETWEEN 1 AND 2 BETWEEN 1", "3"]
I'm not certain if I understand the question, particularly in view of the previous answers, but if you simply wish to extract the indicated substrings from your string, and all column names begin with lowercase letters, you could write the following (where str holds the string given in the question):
str.split(/ +AND +(?=[a-z])/)
#=> ["created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'",
# "updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'",
# "user_id = 5",
# "status = 'closed'"]
The regular expression reads, "match one or more spaces, followed by 'AND', followed by one or more spaces, followed by a positive lookahead that contains a lowercase letter". Being in a positive lookahead, the lowercase letter is not part of the match that is returned.
I'm trying to find the best way to determine the letter count in an array of strings. I'm splitting the string, and then looping every word, then splitting letters and looping those letters.
When I get to the point where I determine the length, the problem I have is that it's counting commas and periods too. Thus, the length in terms of letters only is inaccurate.
I know this may be a lot shorter with regex, but I'm not well versed on that yet. My code is passing most tests, but I'm stuck where it counts commas.
E.g. 'You,' should be string.length = "3"
Sample code:
def abbr(str)
new_words = []
str.split.each do |word|
new_word = []
word.split("-").each do |w| # it has to be able to handle hyphenated words as well
letters = w.split('')
if letters.length >= 4
first_letter = letters.shift
last_letter = letters.pop
new_word << "#{first_letter}#{letters.count}#{last_letter}"
else
new_word << w
end
end
new_words << new_word.join('-')
end
new_words.join(' ')
I tried doing gsub before looping the words, but that wouldn't work because I don't want to completely remove the commas. I just don't need them to be counted.
Any enlightenment is appreciated.
arr = ["Now is the time for y'all Rubiests",
"to come to the aid of your bowling team."]
arr.join.size
#=> 74
Without a regex
def abbr(arr)
str = arr.join
str.size - str.delete([*('a'..'z')].join + [*('A'..'Z')].join).size
end
abbr arr
#=> 58
Here and below, arr.join converts the array to a single string.
With a regex
R = /
[^a-z] # match a character that is not a lower-case letter
/ix # case-insenstive (i) and free-spacing regex definition (x) modes
def abbr(arr)
arr.join.gsub(R,'').size
end
abbr arr
#=> 58
You could of course write:
arr.join.gsub(/[^a-z]/i,'').size
#=> 58
Try this:
def abbr(str)
str.gsub /\b\w+\b/ do |word|
if word.length >= 4
"#{word[0]}#{word.length - 2}#{word[-1]}"
else
word
end
end
end
The regex in the gsub call says "one or more word characters preceded and followed by a word boundary". The block passed to gsub operates on each word, the return from the block is the replacement for the 'word' match in gsub.
You can check for each character that whether its ascii value lies in 97-122 or 65-90.When this condition is fulfilled increment a local variable that will give you total length of string without any number or any special character or any white space.
You can use something like that (short version):
a.map { |x| x.chars.reject { |char| [' ', ',', '.'].include? char } }
Long version with explanation:
a = ['a, ', 'bbb', 'c c, .'] # Initial array of strings
excluded_chars = [' ', ',', '.'] # Chars you don't want to be counted
a.map do |str| # Iterate through all strings in array
str.chars.reject do |char| # Split each string to the array of chars
excluded_chars.include? char # and reject excluded_chars from it
end.size # This returns [["a"], ["b", "b", "b"], ["c", "c"]]
end # so, we take #size of each array to get size of the string
# Result: [1, 3, 2]
I need to verify that a string has at least one comma but not more than 4 commas.
This is what I've tried:
/,{1,4}/
/,\s{1,4}/
Neither of those work.
Note: I've been testing my RegEx's on Rubular
Any help is greatly appreciated.
Note: I'm using this in the context of an Active Record Validation:
validates :my_string, format: { with: /,\s{1,4}/}
How can do this as an Active Record Validation?
Does it have to be a regex? If not, use Ruby's count method:
> "a,a,a,a,a".count(',')
=> 4
str ="a,b,a,,"
p str.count(",").between?(1, 4) # => true
I too would suggest using count, but to address your specific question, you could do it thusly:
r = /^(?:[^,]*,){1,4}[^,]*$/
!!"eenee"[r]
#=> false
!!"eenee, meenee"[r]
#=> true
!!"eenee, meenee, minee, mo"[r]
#=> true
!!"eenee, meenee, minee, mo, oh, no!"[r]
#=> false
(?:[^,]*,) is a non-capture group that matches any string of characters other than a comma, followed by a comma;
{1,4} ensures that the non-capture group is matched between 1 and 4 times;
the anchor ^ ensures there is no comma before the first non-capture group; and
[^,]*$ ensures there is no comma after the last non-capture group.
unless (place =~ /^\./) == 0
I know the unless is like if not but what about the condtional?
=~ means matches regex
/^\./ is a regular expression:
/.../ are the delimiters for the regex
^ matches the start of the string or of a line (\A matches the start of the string only)
\. matches a literal .
It checks if the string place starts with a period ..
Consider this:
p ('.foo' =~ /^\./) == 0 # => true
p ('foo' =~ /^\./) == 0 # => false
In this case, it wouldn't be necessary to use == 0. place =~ /^\./ would suffice as a condition:
p '.foo' =~ /^\./ # => 0 # 0 evaluates to true in Ruby conditions
p 'foo' =~ /^\./ # => nil
EDIT: /^\./ is a regular expression. The start and end slashes denotes that it is a regular expression, leaving the important bit to ^\.. The first character, ^ marks "start of string/line" and \. is the literal character ., as the dot character is normally considered a special character in regular expressions.
To read more about regular expressions, see Wikipedia or the excellent regular-expressions.info website.