Difference between \b and \s in Regular Expression - ios

I was learning regular expression in iOS, saw this tutorial:http://www.raywenderlich.com/30288/nsregularexpression-tutorial-and-cheat-sheet
It reads like this for \b:
\b matches word boundary characters such as spaces and punctuation. to\b will match the "to" in "to the moon" and "to!", but it will not match "tomorrow". \b is handy for "whole word" type matching.
and \s:
\s matches whitespace characters such as spaces, tabs, and newlines. hello\s will match "hello " in "Well, hello there!".
I have two questions on this:
1) what is the difference between \s and \b? when to use which?
2) \b is handy for "whole word" type matching -> Don't understand the meaning..
Need some guidance on these two.

\b Boundary characters
\b matches the boundary itself but not the boundary character (like a comma or period). It has no length in itself but can be used to find for example e in the end of a word.
For example in the sentence: "Hello there, this is one test. Testing"
The regex e\b will match an e if it's at the end of the word (followed by a word boundary). Notice in the image below that the e in "test" and "Testing" didn't match since the "e" is not followed by a boundary.
\s Whitespace
\s on the other hand matches the actual white space characters (like spaces and tabs). In the same sentence it will match all the spaces between the words.
Edit
Since \b doesn't make much sense alone I showed to how to it as e\b (above). The OP asked (in a comment) about what e\s would match compared to e\b to better explain the difference between \b and \s.
In the same string there is only one match for e\s while there was two matches for e\b since the comma is not a whitespace. Note that the e\s match (image 3) includes the white space where as the e\b match doesn't (image 1).

\b is matching a word boundary. That is a zero width assertion, means it is not matching a character, it is matching a position, where a certain condition is true.
\b is related to \w. \w is defining "word characters", means letters, digits and underscores. So \b is now matching on a change from a word character to a non-word character, or the other way round. Means it matches the start and end of a word, but not the character before or after the word.
\s is a predefined character class that is matching any whitespace character.
See and try out what \bFoo\b matches here on Regexr
See and try out what \sFoo\s matches here on Regexr

\b is zero-width. That is, it doesn't actually match any character. Meanwhile, \s does match a character. This is an important distinction for capturing and more complicated regular expressions.
For example, say you're trying to match numbers that begin with multiple zeros, like 007 or 000101101. You might try:
0+\d*
But see, that would also match 1007 and 101000101101! So then, you might try:
\s0+\d*
But see how that wouldn't match a 007 at the beginning of the string (because there's no space character)? Using \b allows you to get the "whole word (or number)":
\b0+\d*

\b matches any character that is not a letter or number without including itself in the match.
\s matches only white space.
For example:
\b would match any of these: "!?,.##$%^&*()_+ ".
$text = "Hello, Yo! moo .";
$regex = "~o\b~";
^---Will match all three o's.
$text = "Hello, Yo! moo .";
$regex = "~o\s~";
^---Will only match the 'o' in 'moo'.

Related

Regular wrong regular expression, not validating

please i want to validate the inputs from a user, the format for the inputs would be: 3 uppercase characters, 3 integer numbers, an optional space, a -, an optional space, either a 'LAB or ((EN or ENLH) with 1 interger number ranging from a [1-9]).
The regex i wrote is
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))/
am finding it difficult to stop inputs after the LAB so that when EEE333 - LAB1 is inputed it becomes invalid.
If you are asking how to prevent LAB1 at the end, use an end of line anchor $ in your regex test:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?\d{1}))$/
If you are trying to require exactly one digit at the end of the acceptable strings, move the single digit match outside of the optional groups:
/\D{3}\d{3}\s?-\s?(LAB|(EN(LH)?))\d{1}$/
I have wrote for you the following regular expression:
[A-Z]{3}[0-9]{3}\s?-\s?(?:LAB|(?:EN|LH))[1-9]{1}
The regex works a follows:
[A-Z]{3}
MATCH EXACTLY THREE UPPERCASE CHARACTERS RANGING FROM A TO Z
[0-9]{3}
MATCH EXACTLY THREE NUMBERS RANGING FROM 0 TO 9
\s?\-\s?
MATCH a space (optional) or a '-' (required) or a space (optional)
(?:LAB|(?:EN|LH))
MATCH 'LAB' OR ('EN' OR 'LH')?: omits capturing LAB OR EN OR LH
[1-9]{1}
MATCH EXACTLY ONE NUMBERS RANGING FROM 1 TO 9
You could place your regex between word boundaries \b.
You start your regex with \D which is any character that is not a digit. That would for example also match $%^. You could use [A-Z].
You use \d{1} which is a shorhand for [0-9], but you want to match a digit between 1 and 9 [1-9]. You could also omit the {1}.
Maybe this updated will work for you?
\b[A-Z]{3}\d{3} ?- ?(?:LAB|(?:EN(?:LH)?[1-9]))\b
Explanation
A word boundary \b
Match 3 uppercase characters [A-Z]{3}
Match 3 digits \d{3}
Match an optional whitespace, a hyphen and another optional whitespace ?- ?
A non capturing group which for example matches LAB or EN EN1 or ENLH or ENLH9 (?:EN(?:LH)?[1-9]))
A word boundary \b

Flex how to differentiate between capital words, lower case words and words?

I have the following rules:
capital_word [A-Z]+
lower_case_word [a-z]+
word [^ \t\n\.]
delim [ \t\n\.]
For the word "Hello", it says "H" is a capital word and "ello" a lower case word. How could I do to have "Hello" as "Word"?
If you're testing a single word, you want to match the whole word, and you want to allow lowercase letters after the first capital.
capital_word ^[A-Z][a-zA-Z]+$
lower_case_word ^[a-z]+$
word ^[^ \t\n\.]+$
delim [ \t\n\.]
^ is beginning of test and $ is end of test, meaning you want to match all text. It's needed for the first three but not the last (since in last you just want to know if a delimiter is present, I think).

Ultraedit regex to remove all words which contains number

I am trying to make a Ultraedit regex which allows me to remove all words of a txt file containing a number.
For example:
test
test2
t2est
te2st
and...
get only
test
A case-insensitive search with Perl regular expression search string \<[a-z]+\d\w*\> finds entire words containing at least 1 digit.
\< ... beginning of a word. \b for any word boundary could be also used.
[a-z]+ ... any letter 1 or more times. You can put additional characters into the square brackets like ÄÖÜäöüß also used in language of text file.
\d ... any digit, i.e. 0-9.
\w* ... any word character 0 or more times. Any word character means all word characters according to Unicode table which includes language dependent word characters, all digits and the underscore.
\> ... end of a word. \b for any word boundary could be also used.
A case-insensitive search with UltraEdit regular expression search string [a-z]+[0-9][a-z0-9_]++ finds also entire words containing at least 1 digit if additionally the find option Match whole word is also checked.
[a-z]+ ... any letter 1 or more times. You can put additional characters into the square brackets used in language of text file.
[0-9] ... any digit.
[a-z0-9_]++ ... any letter, digit or underscore 0 or more times.
The UltraEdit regexp search string [a-z]+[0-9][a-z0-9_]++ in Unix/Perl syntax would be [a-z]+[0-9][a-z0-9_]* which could be also used with find option Match whole word checked instead of the Perl regexp search.

What does this pattern ^[%w-.]+$ mean in Lua?

Just came across this pattern, which I really don't understand:
^[%w-.]+$
And could you give me some examples to match this expression?
Valid in Lua, where %w is (almost) the equivalent of \w in other languages
^[%w-.]+$ means match a string that is entirely composed of alphanumeric characters (letters and digits), dashes or dots.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The character class [%w-.] matches one character that is a letter or digit (the meaning of %w), or a dash, or a period. This would be the equivalent of [\w-.] in JavaScript
The + quantifier matches such a character one or more times
The $ anchor asserts that we are at the end of the string
Reference
Lua Patterns
Actually it will match nothing. Because there is an error: w- this is a start of a text range and it is out of order. So it should be %w\- instead.
^[%w\-.]+$
Means:
^ assert position at start of the string
[%w\-.]+ match a single character present in the list below
+ Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
%w a single character in the list %w literally (case sensitive)
\- matches the character - literally
. the literal character .
$ assert position at end of the string
Edit
As the OP changed the question and the tags this answer no longer fits as a proper answer. It is POSIX based answer.
As #zx81 comment:
%w is \w in Lua which means any alphanumeric characters plus "_"

Why can't regular expressions match for # sign?

For the string Be there # six.
Why does this work:
str.gsub! /\bsix\b/i, "seven"
But trying to replace the # sign doesn't match:
str.gsub! /\b#\b/i, "at"
Escaping it doesn't seem to work either:
str.gsub! /\b\#\b/i, "at"
This is down to how \b is interpreted. \b is a "word boundary", wherein a zero-length match occurs if \b is preceded by or followed by a word character. The word characters are limited to [A-Za-z0-9_] and maybe a few other things, but # is not a word character, so \b won't match just before it (and after a space). The space itself is not the boundary.
More about word boundaries...
If you need to replace the # with surrounding whitespace, you can capture it after the \b and use backreferences. This captures preceding whitespace with \s* for zero or more space characters.
str.gsub! /\b(\s*)#(\s*)\b/i, "\\1at\\2"
=> "Be there at six"
Or to insist upon whitespace, use \s+ instead of \s*.
str = "Be there # six."
str.gsub! /\b(\s+)#(\s+)\b/i, "\\1at\\2"
=> "Be there at six."
# No match without whitespace...
str = "Be there#six."
str.gsub! /\b(\s+)#(\s+)\b/i, "\\1at\\2"
=> nil
At this point, we're starting to introduce redundancies by forcing the use of \b. It could just as easily by done with /(\w+\s+)#(\s+\w+)/, foregoing the \b match for \w word characters followed by \s whitespace.
Update after comments:
If you want to treat # like a "word" which may appear at the beginning or end, or inside bounded by whitespace, you may use \W to match "non-word" characters, combined with ^$ anchors with an "or" pipe |:
# Replace # at the start, middle, before punctuation
str = "# Be there # six #."
str.gsub! /(^|\W+)#(\W+|$)/, '\\1at\\2'
=> "at Be there at six at."
(^|\W+) matches either ^ the start of the string, or a sequence of non-word characters (like whitespace or punctuation). (\W+|$) is similar but can match the end of the string $.
\b matches a word boundary, which is where a word character is next to a non-word character. In your string the # has a space on each side, and neither # or space are word characters so there is no match.
Compare:
'be there # six'.gsub /\b#\b/, 'at'
produces
'be there # six'
(i.e. no changes)
but
'be there#six'.gsub /\b#\b/, 'at' # no spaces around #
produces
"be thereatsix"
Also
'be there # six'.gsub /#/, 'at' # no word boundaries in regex
produces
"be there at six"

Resources