Why is "\b" not catching my word boundary in Rails? - ruby-on-rails

I’m using Rails 4.2.7. I want to match an expression and take advantage of the word boundary expression because there may be a space, arbitrary number of spaces, or a dash between my words. But this isn’t working
2.3.0 :006 > /first\bsecond/.match('first second')
=> nil
The manual here — https://ruby-doc.org/core-2.1.1/Regexp.html suggests that “\b” is the right expression to use to catch word boundaries so I’m wondering where I’m going wrong.

\b matches a zero-length word boundary, not a space. You're looking for something more like this:
/first\b.\bsecond/.match('first second')
This will match any character (.) in between first and second, as long as there is a word boundary on either side.
However, this is not how word boundaries are usually used (since there is no need to use a zero-length check when you are matching the word boundary itself). \b is essentially looking for a non-word character after a word character; so, instead, you could just look for a non-word character in-between the t in first and s in second:
/first\Wsecond/.match('first second')
This is exactly the same as the first example...but, realistically, you probably just want to match whitespace and can use something like this:
/first\ssecond/.match('first second')
#WiktorStribiżew's third example shows the best use of word boundaries (at the beginning and end). This is because you aren't matching anything before or after, so a zero-length test is helpful. Otherwise, the above examples could match something like first secondary. In the end, I'd use an expression like:
/\bfirst\ssecond\b/.match('first second')

Related

Regex inverse match

I'm developing my lexer with flex and I need to create a rule that matches '' (two single quotes) and a rule that matches anything but two single quotes. The first part is easy, just a \'\' does the job, but I'm not sure how to write the other rule. I guess it needs to be some kind of inverse regex, but I'm not familiar with flex regex.
Thanks
What do you mean exactly by "anything other than two single quotes"? Any string of any length which does not contain ''? Any two characters other than ''? The shortest string up to the next occurrence of ''?
The third option is the only one which makes sense to me in the context of lexical analysis; its corresponding regular expression is:
([']?[^'])+
(That is, any sequence of characters in which a ', if it occurs, is followed by something other than another '.)
For the second task, split the string with the delimeter ''. So you have all substring which does not contain the delimeter and separated by it.
Try this in flex:
(([^'])|(\'[^']))+
An explanation:
[^'] matches any character but a single quote.
\'[^'] matches a single quote followed by any other character.
EDIT: added in extra parens to ensure correct precedence.
you can use this rule:
([^']+|\'[^']+)+|([^']+|\'[^']+)+\'$
since you define an other rule: '' and since flex will take the longest match for a position. This rule can't match two (or more) consecutive quotes, and allows a single quote at the end of the string.

CFStringTokenizer not tokenizing lower-case sentences

I'm trying to use CFStringTokenizer with kCFStringTokenizerUnitSentence to split a string into sentences. The first problem I'm having is that sentences need to be capitalized in order for them to be recognized as sentences. If not, it just thinks it's part of the previous sentence.
I'm splitting user-entered text so I'm expecting the text to be very unclean.
Is there something else I can do with CFStringTokenizer to have it detect uncapitalized sentences? Or will I have to use another method of splitting altogether?
I followed the answer on this SO question for my implementation:
How to get an array of sentences using CFStringTokenizer?
NOTE: After testing a bit more it seems that with kCFStringTokenizerUnitSentence, if a '!' or a '?' is followed by an uncapitalized sentence, it will recognize the sentence. Also, if one of those punctuation marks is followed by a sentence without a space between the '!' and the first word, it will still separate.
So the one case I need to work around is a '.' followed by an uncapitalized sentence.
ANOTHER OPTION I found, if you're getting the text from a textField, is to use this:
textField.autocapitalizationType = UITextAutocapitalizationTypeSentences;
It will automatically capitalize sentences so you don't have to worry about converting for CFStringTokenizer. It still doesn't account for edge cases like abbreviations, but at least in my case the user will have an option to delete the auto-capitalization if it's wrong.
You can convert the input string to all uppercase first and then run it through CFStringTokenizer and use the ranges to get the substrings of the original input string. But you must be careful here because some characters might become more than 1 character after conversion to uppercase.

What to do when unescapable character(s) are escaped?

In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').

Regex for string first chars

in my Rails app I need to validate a string that on creation can not have its first chars empty or composed by any special chars.
For example: " file" and "%file" aren't valid. Do you know what Regex I should use?
Thanks!
The following regex will only match if the first letter of the string is a letter, number, or '_':
^\w
To restrict to just letters or numbers:
^[0-9a-zA-Z]
The ^ has a special meaning in regular expressions, when it is outside of a character class ([...]) it matches the start of the string (without actually matching any characters).
If you want to match all invalid strings you can place a ^ inside of the character class to negate it, so the previous expressions would be:
^[^\w]
or
^[^0-9a-zA-Z]
A good place to interactively try out Ruby regexes is Rubular. The link I gave shows the answer that #Dave G gave along with a few test examples (and at first glance it seems to work). You could expand the examples to convince yourself further.
The regex
^[^[:punct:][:space:]]+
Should do what you want. I'm not 100% sure of what Ruby provides as far as regular expressions and POSIX class support so your mileage on this may vary.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Resources