Define a regex, which matches one digit twice and all others once - ruby-on-rails

As part of a larger regex I would like to match the following restrictions:
The string has 11 digits
All digits are numbers
Within the first 10 digits one number [0-9] (and one only!) must be listed twice
This means the following should match:
12345678914
12235879600
Whereas these should not:
12345678903 -> none of the numbers at digits 1 to 10 appears twice
14427823482 -> one number appears more than twice
72349121762 -> two numbers appear twice
I have tried to use a lookahead, but all I'm managing is that the regex counts a certain digit, i.e.:
(?!.*0\1{2})
That does not do what I need. Is my query even possible with regex?

You can use this kind of pattern:
\A(?=\d{11}\z)(?:(\d)(?!\d*\1\d))*(\d)(?=\d*\2\d)(?:(\d)(?!\d*\3\d))+\d\z
online demo
pattern details:
the idea is to describe string as a duplicate digit surrounded by non duplicate digits.
Finding a duplicate digit is easy with a capture group, a lookahead assertion and a backreference:(\d)(?=\d*\1)
You can use the same pattern to ensure that a digit has no duplicate, but this time with a negative lookahead: (\d)(?!\d*\1)
To not take in account the last digit (digit n°11) in the search of duplicates, you only need to add a digit after the backreference. (\d)(?=\d*\1\d) (in this way you ensure there is at least one digit between the backreference and the end of the string.)
Note that in the present context, what is called a duplicate digit is a digit that is not followed immediatly or later with the same digit. (i.e. in 1234567891 the first 1 is a duplicate digit, but the last 1 is no more a duplicate digit because it is not followed by an other 1)
\A # begining of the string
(?=\d{11}\z) # check the string length (if not needed, remove it)
(?:(\d)(?!\d*\1\d))* # zero or more non duplicate digits
(\d)(?=\d*\2\d) # one duplicate digit
(?:(\d)(?!\d*\3\d))+ # one or more non duplicate digits
\d # the ignored last digit
\z # end of the string
an other way
This time you check the duplicates at the begining of the pattern with lookaheads. One lookahead to ensure there is one duplicate digit, one negative lookahead to ensure there are not two duplicate digits:
\A(?=\d*(\d)(?=\d*\1\d))(?!\d*(\d)(?=\d*\2\d)\d*(\d)(?=\d*\3\d))\d{11}\z
pattern details:
\A
(?= # check if there is one duplicate digit
\d*(\d)(?=\d*\1\d)
)
(?! # check if there are not two duplicate digits
\d*(\d)(?=\d*\2\d) # the first
\d*(\d)(?=\d*\3\d) # the second
)
\d{11}
\z
Note: However it seems that the first way is more efficient.
The code way
You can easily check if your string fit the requirements with array methods:
> mydigs = "12345678913"
=> "12345678913"
> puts (mydigs.split(//).take 10).uniq.size == 9
true
=> nil

Related

Regular wrong regular expression, not validating

please i want to validate the inputs from a user, the format for the inputs would be: 3 uppercase characters, 3 integer numbers, an optional space, a -, an optional space, either a 'LAB or ((EN or ENLH) with 1 interger number ranging from a [1-9]).
The regex i wrote is
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))/
am finding it difficult to stop inputs after the LAB so that when EEE333 - LAB1 is inputed it becomes invalid.
If you are asking how to prevent LAB1 at the end, use an end of line anchor $ in your regex test:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))$/
If you are trying to require exactly one digit at the end of the acceptable strings, move the single digit match outside of the optional groups:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?))\d{1}$/
I have wrote for you the following regular expression:
[A-Z]{3}[0-9]{3}\s?-\s?(?:LAB|(?:EN|LH))[1-9]{1}
The regex works a follows:
[A-Z]{3}
MATCH EXACTLY THREE UPPERCASE CHARACTERS RANGING FROM A TO Z
[0-9]{3}
MATCH EXACTLY THREE NUMBERS RANGING FROM 0 TO 9
\s?\-\s?
MATCH a space (optional) or a '-' (required) or a space (optional)
(?:LAB|(?:EN|LH))
MATCH 'LAB' OR ('EN' OR 'LH')?: omits capturing LAB OR EN OR LH
[1-9]{1}
MATCH EXACTLY ONE NUMBERS RANGING FROM 1 TO 9
You could place your regex between word boundaries \b.
You start your regex with \D which is any character that is not a digit. That would for example also match $%^. You could use [A-Z].
You use \d{1} which is a shorhand for [0-9], but you want to match a digit between 1 and 9 [1-9]. You could also omit the {1}.
Maybe this updated will work for you?
\b[A-Z]{3}\d{3} ?- ?(?:LAB|(?:EN(?:LH)?[1-9]))\b
Explanation
A word boundary \b
Match 3 uppercase characters [A-Z]{3}
Match 3 digits \d{3}
Match an optional whitespace, a hyphen and another optional whitespace ?- ?
A non capturing group which for example matches LAB or EN EN1 or ENLH or ENLH9 (?:EN(?:LH)?[1-9]))
A word boundary \b

Combine these regex expressions

I have two regular expressions: ^(\\p{L}|[0-9]|_)+$ and #[^[:punct:][:space:]]+ (the first is used in Java, the second on iOS). I want to combine these into one expression, to match either one or the other in iOS.
The first one is for a username so I also need to add a # character to the start of that one. What would that look like?
The ^(\\p{L}|[0-9]|_)+$ pattern in Java matches the same way as in ICU library used in iOS (they are very similar): a whole string consisting of 1 or more Unicode letters, ASCII digits or _. It is poorly written as the alternation group is quantified and that is much less efficient than a character class based solution, ^[\\p{L}0-9_]+$.
The #[^[:punct:][:space:]]+ pattern matches a # followed with 1 or more chars other than punctuation/symbols and whitespace chars (that is, 1 or more letters or digits, or alphanumeric chars).
What you seek can be writtern as
#[\\p{L}0-9_]+|[^[:punct:][:space:]]+
or
#[\\p{L}0-9_]+|#[[:alnum:]]+
or if you want to limit to ASCII digits and not match Unicode digits:
#[\\p{L}0-9_]+|#[\\p{L}0-9]+
It matches
# - a # symbol
[\\p{L}0-9_]+ - 1 or more Unicode letters, ASCII diigts, _
| - or
# - a # char
[[:alnum:]]+ - 1 or more letters or digits.
[^[:punct:][:space:]]+ - any 1+ chars other than punctuation/symbols and whitespace.
Basically, all these expressions match strings like this.
If you want to match #SomeThing_123 in full, just use [##]\\w+, a # or # and then 1 or more letters, digits or _, or to only allow ASCII digits, [##][\\p{L}0-9_]+.
A word boundary may be required at the end of the pattern, [##][\\p{L}0-9_]+\\b.

What does this pattern ^[%w-.]+$ mean in Lua?

Just came across this pattern, which I really don't understand:
^[%w-.]+$
And could you give me some examples to match this expression?
Valid in Lua, where %w is (almost) the equivalent of \w in other languages
^[%w-.]+$ means match a string that is entirely composed of alphanumeric characters (letters and digits), dashes or dots.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The character class [%w-.] matches one character that is a letter or digit (the meaning of %w), or a dash, or a period. This would be the equivalent of [\w-.] in JavaScript
The + quantifier matches such a character one or more times
The $ anchor asserts that we are at the end of the string
Reference
Lua Patterns
Actually it will match nothing. Because there is an error: w- this is a start of a text range and it is out of order. So it should be %w\- instead.
^[%w\-.]+$
Means:
^ assert position at start of the string
[%w\-.]+ match a single character present in the list below
+ Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
%w a single character in the list %w literally (case sensitive)
\- matches the character - literally
. the literal character .
$ assert position at end of the string
Edit
As the OP changed the question and the tags this answer no longer fits as a proper answer. It is POSIX based answer.
As #zx81 comment:
%w is \w in Lua which means any alphanumeric characters plus "_"

Full name regex in Ruby

I know there are lots of similar questions, but I couldn't find my case anywhere.
I'm trying to write a Full Name RegEx in Ruby on Rails user model.
It should validate that first name and last name are filled with one whitespace. Both of the names should contain at least 2 characters (ex: Li Ma).
As a bonus, but not necessary I would like to trim the whitespaces to one character in case that user will mistype and enter more than one whitespace (ex: Li Ma will be trimmed to Li Ma)
Currently I'm validating it like that (Warning: It might be incorrect):
validates :name,
presence: true,
length: {
maximum: 64,
minimum: 5,
message: 'must be a minimum: 5 letters and a maximum: 64 letters'},
format: {
# Full Name RegEx
with: /[\w\-\']+([\s]+[\w\-\']){1}/
}
This works for me, but doesn't check for minimum 2 characters for each name (ex: Peter P is now correct). This also accepts multiple whitespaces which is not good (ex: Peter P)
I know that this problem of identifying names is very culture-centric and it might be not a proper way to validate full name (maybe there are people with one character name), but this is currently a requirement.
I don't want to split this field to 2 different fields First name and Last name as it will complicate user interface.
You could match the following regex:
/([\w\-\']{2,})([\s]+)([\w\-\']{2,})/
and replace with: (assuming it supports capturing groups)
'\1 \3' or $1 $3 whatever the syntax is:
It gets rid of extra whitespaces and only keeps one, as you wanted.
Demo: http://regex101.com/r/oQ6aO7
result = subject.gsub(/\A(?=[\w' -]{5,64})([\w'-]{2,})([\s]{1})\s*?([\w'-]{2,})\Z/, '\1\2\3')
http://regex101.com/r/dT1fJ4
Assert position at the beginning of the string «^»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[\w' -]{5,64})»
Match a single character present in the list below «[\w' -]{5,64}»
Between 5 and 64 times, as many times as possible, giving back as needed (greedy) «{5,64}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “ ” « »
The character “-” «-»
Match the regular expression below and capture its match into backreference number 1 «([\w'-]{2,})»
Match a single character present in the list below «[\w'-]{2,}»
Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “-” «-»
Match the regular expression below and capture its match into backreference number 2 «([\s]{1})»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «[\s]{1}»
Exactly 1 times «{1}»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 3 «([\w'-]{2,})»
Match a single character present in the list below «[\w'-]{2,}»
Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

Write a Lex rule to parse Integer and Float

I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.

Resources