Regex for string first chars - ruby-on-rails

in my Rails app I need to validate a string that on creation can not have its first chars empty or composed by any special chars.
For example: " file" and "%file" aren't valid. Do you know what Regex I should use?
Thanks!

The following regex will only match if the first letter of the string is a letter, number, or '_':
^\w
To restrict to just letters or numbers:
^[0-9a-zA-Z]
The ^ has a special meaning in regular expressions, when it is outside of a character class ([...]) it matches the start of the string (without actually matching any characters).
If you want to match all invalid strings you can place a ^ inside of the character class to negate it, so the previous expressions would be:
^[^\w]
or
^[^0-9a-zA-Z]

A good place to interactively try out Ruby regexes is Rubular. The link I gave shows the answer that #Dave G gave along with a few test examples (and at first glance it seems to work). You could expand the examples to convince yourself further.

The regex
^[^[:punct:][:space:]]+
Should do what you want. I'm not 100% sure of what Ruby provides as far as regular expressions and POSIX class support so your mileage on this may vary.

Related

Regex inverse match

I'm developing my lexer with flex and I need to create a rule that matches '' (two single quotes) and a rule that matches anything but two single quotes. The first part is easy, just a \'\' does the job, but I'm not sure how to write the other rule. I guess it needs to be some kind of inverse regex, but I'm not familiar with flex regex.
Thanks
What do you mean exactly by "anything other than two single quotes"? Any string of any length which does not contain ''? Any two characters other than ''? The shortest string up to the next occurrence of ''?
The third option is the only one which makes sense to me in the context of lexical analysis; its corresponding regular expression is:
([']?[^'])+
(That is, any sequence of characters in which a ', if it occurs, is followed by something other than another '.)
For the second task, split the string with the delimeter ''. So you have all substring which does not contain the delimeter and separated by it.
Try this in flex:
(([^'])|(\'[^']))+
An explanation:
[^'] matches any character but a single quote.
\'[^'] matches a single quote followed by any other character.
EDIT: added in extra parens to ensure correct precedence.
you can use this rule:
([^']+|\'[^']+)+|([^']+|\'[^']+)+\'$
since you define an other rule: '' and since flex will take the longest match for a position. This rule can't match two (or more) consecutive quotes, and allows a single quote at the end of the string.

Regular expression for hashtag text with multi language support

I have a texts like #sample_123 , #123_sample , #_sample123 so i have to use regular expression to check the text contains only alphanumeric and underscore and also i want to support multi languages.
Currently i am using regular expression like (#)([:alpha:]+) but it detects only #sample( eg: #sample_123). So, Can any one please suggest the correct regular expression to fix out this problem.
you can use:
^#(\d|\w|_)+$
Debuggex Demo
This would validate any words that start with an hash and contains only alpha numeric characters or underscore. Of course there are no restrictions on how many characters after the hashtag there should be, so for example, a hashtag like #_ is considered valid, if this is not the wanted behavior please be more detailed on the constraints you want.

Matching Unicode letters with RegExp

I am in need of matching Unicode letters, similarly to PCRE's \p{L}.
Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.
I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.
So, I want to match letters like:
foobar
מכון ראות
But the R symbol shouldn't be matched:
BlackBerry®
Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.
I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:
RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false
I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,
isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);
might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.
Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.
Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.
Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)
There's no support for this yet in Dart or JS.
The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.

Regex to validate user names with at least one letter and no special characters

I'm trying to write a user name validation that has the following restrictions:
Must contain at least 1 letter (a-zA-Z)
May not contain anything other than digits, letters, or underscores
The following examples are valid: abc123, my_name, 12345a
The following examples are invalid: 123456, my_name!, _1235
I found something about using positive lookaheads for the letter contraint: (?=.*[a-zA-Z]), and it looks like there could be some sort of negative lookahead for the second constraint, but I'm not sure how to mix them together into one regex. (Note... I am not really clear on what the .* portion does inside the lookahead...)
Is it something like this: /(?=.*[a-zA-Z])(?!.*[^a-zA-Z0-9_])/
Edit:
Because the question asks for a regex, the answer I'm accepting is:
/^[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*$/
However, the thing I'm actually going to implement is the suggestion by Bryan Oakley to split it into multiple smaller checks. This makes it easier to both read and extend in the future in case requirements change. Thanks all!
And because I tagged this with ruby-on-rails, I'll include the code I'm actually using:
validate :username_format
def username_format
has_one_letter = username =~ /[a-zA-Z]/
all_valid_characters = username =~ /^[a-zA-Z0-9_]+$/
errors.add(:username, "must have at least one letter and contain only letters, digits, or underscores") unless (has_one_letter and all_valid_characters)
end
/^[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*$/: 0 or more valid characters followed by one alphabetical followed by 0 or more valid characters, constrained to be both the beginning and the end of the line.
It's easy to check whether the pattern has any illegal characters, and it's easy to check whether there's at least one letter. Trying to do that all in one regular expression will make your code hard to understand.
My recommendation is to do two tests. Put the tests in functions to make your code absolutely dead-simple to understand:
if no_illegal_characters(string) && contains_one_alpha(string) {
...
}
For the former you can use the pattern ^[a-zA-Z0-9_]+$, and for the latter you can use [a-zA-Z].
If you don't like the extra functions that's ok, just don't try to solve the problem with one difficult-to-read regular expression. There are no bonus points awarded for cramming as much functionality into one expression as possible.
the simplest regex that resolve your problem is:
/^[a-zA-Z0-9][a-zA-Z0-9_]*$/
I encourage you to try it out live on http://rubular.com/

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Resources