Regex pattern for checking name with limited characters - ios

I have a requirement to validate username with some special characters in it('-) and white space in it. I am able to achieve this with the help of the following regex -
^[a-zA-z]+([ '-][a-zA-Z]+)*$
But I am unable to add a limit to the maximum number of characters that user can enter say 25 to this particular regex. So can any one please explain how to do the same for the above regex? Thanks.

You may add a negative lookahead at the beginning disallowing 26 or more chars:
^(?!.{26})[a-zA-Z]+([ '-][a-zA-Z]+)*$
^^^^ ^
You also have a typo [A-z], it must be [A-Z]. See Difference between regex [A-z] and [a-zA-Z].
The negative lookahead (the (?!...) construct above) is anchored at the start of the string (meaning it is placed right after ^), and the length check is performed only once, right before parsing with the main, consuming pattern part.
You can also see more on how a negative lookahead works here.

Just add a lookahead at the beginning to match only for a max of 25 characters by adding a (?=.{1,25}$).
Your new regex will be: ^(?=.{1,25}$)[a-zA-z]+([ '-][a-zA-Z]+)*$

Related

Regular wrong regular expression, not validating

please i want to validate the inputs from a user, the format for the inputs would be: 3 uppercase characters, 3 integer numbers, an optional space, a -, an optional space, either a 'LAB or ((EN or ENLH) with 1 interger number ranging from a [1-9]).
The regex i wrote is
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))/
am finding it difficult to stop inputs after the LAB so that when EEE333 - LAB1 is inputed it becomes invalid.
If you are asking how to prevent LAB1 at the end, use an end of line anchor $ in your regex test:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))$/
If you are trying to require exactly one digit at the end of the acceptable strings, move the single digit match outside of the optional groups:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?))\d{1}$/
I have wrote for you the following regular expression:
[A-Z]{3}[0-9]{3}\s?-\s?(?:LAB|(?:EN|LH))[1-9]{1}
The regex works a follows:
[A-Z]{3}
MATCH EXACTLY THREE UPPERCASE CHARACTERS RANGING FROM A TO Z
[0-9]{3}
MATCH EXACTLY THREE NUMBERS RANGING FROM 0 TO 9
\s?\-\s?
MATCH a space (optional) or a '-' (required) or a space (optional)
(?:LAB|(?:EN|LH))
MATCH 'LAB' OR ('EN' OR 'LH')?: omits capturing LAB OR EN OR LH
[1-9]{1}
MATCH EXACTLY ONE NUMBERS RANGING FROM 1 TO 9
You could place your regex between word boundaries \b.
You start your regex with \D which is any character that is not a digit. That would for example also match $%^. You could use [A-Z].
You use \d{1} which is a shorhand for [0-9], but you want to match a digit between 1 and 9 [1-9]. You could also omit the {1}.
Maybe this updated will work for you?
\b[A-Z]{3}\d{3} ?- ?(?:LAB|(?:EN(?:LH)?[1-9]))\b
Explanation
A word boundary \b
Match 3 uppercase characters [A-Z]{3}
Match 3 digits \d{3}
Match an optional whitespace, a hyphen and another optional whitespace ?- ?
A non capturing group which for example matches LAB or EN EN1 or ENLH or ENLH9 (?:EN(?:LH)?[1-9]))
A word boundary \b

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

Checking whether a string contains a phone number

Trying to work out how to parse out phone numbers that are left in a string.
e.g.
"Hi Han, this is Chewie, Could you give me a call on 02031234567"
"Hi Han, this is Chewie, Could you give me a call on +442031234567"
"Hi Han, this is Chewie, Could you give me a call on +44 (0) 203 123 4567"
"Hi Han, this is Chewie, Could you give me a call on 0207-123-4567"
"Hi Han, this is Chewie, Could you give me a call on 02031234567 OR +44207-1234567"
And be able to consistently replace any one of them with some other item (e.g. some text, or a link).
Am assuming it's a regex type approach (I'm already doing something similar with email which works well).
I've got to
text.scan(/([^A-Z|^"]{6,})/i)
Which leaves me a leading space I can't work out how to drop (would appreciate the help there).
Is there a standard way of doing this that people use?
It also drops things into arrays, which isn't particularly helpful
i.e. if there were multiple numbers.
[["02031234567"]["+44207-1234567"]]
as opposed to
["02031234567","+44207-1234567"]
Adding in the third use-case with spaces is difficult. I think the only way to successfully meet that acceptance criteria would be to chain a #gsub call on to your #scan.
Thus:
text.gsub(/\s+/, "").scan(/([^A-Z|^"|^\s]{6,})/i)
The following code will extract all the numbers for you:
text.scan(/(?<=[ ])[\d \-+()]+$|(?<=[ ])[\d \-+()]+(?=[ ]\w)/)
For the examples you supplied this results in:
["02031234567"]
["+442031234567"]
["+44 (0) 203 123 4567"]
["0207-123-4567"]
["02031234567", "+44207-1234567"]
To understand this regex, what we are matching is:
[\d \-+()]+ which is a sequence of one or more digits, spaces, minus, plus, opening or closing brackets (in any order - NB regex is greedy by default, so it will match as many of these characters next to each other as possible)
that must be preceded by a space (?<=[ ]) - NB the space in the positive look-behind is not captured, and therefore this makes sure that there are no leading spaces in the results
and is either at the end of the string $, or | is followed by a space then a word character (?=[ ]\w) (NB this lookahead is not captured)
This pattern will get rid of the space but not match your third case with spaces:
/([^A-Z|^"|^\s]{6,})/i
This is what I came to in the end in case it helps somebody
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
That gives me an array of
["+442031234567", "02031234567"]
I'm sure there is a more elegant way of doing this and possibly you'd want to check the numbers for likelihood of being phonelike - e.g. using the brilliant Phony gem.
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
real_numbers = numbers.keep_if{|n| Phony.plausible? PhonyRails.normalize_number(n, default_country_code: "GB")}
Which should help exclude serial numbers or the like from being identified as numbers. You'll obviously want to change the country code to something relevant for you.

What does these two regex match?

I can't figure out what does this regex match:
A: "\\/\\/c\\/(\\d*)"
B: "\\/\\/(\\d*)"
I suppose they are matching some kind of number sequence since \d matches any digit but I'd like to know an example of a string that would be a match for this regex.
The pattern syntax is that specified by ICU. Expressions are created with NSRegularExpression in an iOS app and are correct.
The first matches //c/ + 0 or more digits. The second matches // + 0 or more digits. In both the digits are captured.
An example of a match for A) is //c/123
An example of a match for B) is //12345
When I use Cygwin which emulates Bash on Windows, I sometimes run into situations where I have to escape my escape characters which is what I think is making this expression look so weird. For instance, when I use sed to look for a single '\' I sometimes have to write it as '\\\\'. (Funny, StackOverflow proved my point. If you write 4 backslashes in the comment, it only shows two. So if you process it again, they might all disappear depending on your situation).
Considering this, it might be helpful to think of pairs of backslashes as representing only one if you're coming from a similar situation. My guess would be you are. Because of this I would say Erik Duymelinck is probably spot on. This will capture a sequence of digits that may or may not follow a couple slashes and a c:
//c/000
//00000
This regex matches an odd sequence of characters, which, at first glance, almost seem like a regex, since \d is a digit, and followed by an asterisk (\d*) would mean zero-or-more digits. But it's not a digit, because the escape-slash is escaped.
\\/\\/c\\/(\\d*)
So, for instance, this one matches the following text:
\/\/c\/\
\/\/c\/\d
\/\/c\/\dd
\/\/c\/\ddd
\/\/c\/\dddd
\/\/c\/\ddddd
\/\/c\/\dddddd
...
This one is almost the same
\\/\\/(\\d*)
except you just delete the c\/ from the above results:
\/\/\
\/\/\d
\/\/\dd
\/\/\ddd
\/\/\dddd
\/\/\ddddd
\/\/\dddddd
...
In both cases, the final \ and optional d is [capture group][1] one.
My first impression was that these regexes were intended for escaping in Java strings, meaning they would be completely invalid. If the were escaped for Java strings, such as
Pattern p = Pattern.compile("\\/\\/c\\/(\\d*)");
It would be invalid, because after un-escaping, it would result in this invalid regex:
\/\/c\/(\d*)
The single escape-slashes (\) are invalid. But the \d is valid, as it would mean any digit.
But again, I don't think they're invalid, and they're not escaped for a Java string. They're just odd.

Write a Lex rule to parse Integer and Float

I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.

Resources