I need to write some validation rules for a user password with the following requirements. C# ASP.NET MVC.
Passwords must be 6 - 8 characters
Must include at least one character
each from at least three of the
following categories:
Upper-case letters
Lower-case letters
Numeric digits
Non-alpha-numeric characters (e.g.,!##$%...)
Must not contain any
sequence of 3 or more characters in
common with the username Must not
repeat any of the previous 1 passwords
Must be changed if the password is
believed to be compromised in any way
Currently i've written a bunch of really messy validation rules using if statements and loops (especially the 3 characters in sequence with username part), which is currently functional but it just feels like its wrong. Is there a better approach I can take?
Thankyou
I wrote one very similar to what you are describing. They can be done as a regular expression, and when complete (at least for myself) it was a very rewarding accomplishment.
To accomplish this you are going to need to use a regex feature called lookaheads. See the information on the regular-expression.info site for all the gory details.
The second thing you will need is a real time regular expression tester to help you prototype your regex. I suggestion you check out Rubular. Create several passwords that should work, and some that shouldn't work and start from there as your starting point.
Edit:
To elaborate on my above comment. Not every one of your requirements can or should be solved via a regex. Namely, the requirements you listed as:
Must not contain any sequence of 3 or more characters in common with the username
Must not repeat any of the previous 1 passwords
Must be changed if the password is believed to be compromised in any way
Should probably be handled separately from the main password validation regex, as these are highly contextual. The "sequence of 3 or more characters in common with the username" can probably be handled on the client side. However, the other two items are probably best left handled on the server side.
Related
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb
I've recently been receiving a lot of first name only entries in a form. While maybe I should have had 2 separate first and last name fields this always seemed to me a bit much. But I would like to try and get a full name which basically can only be determined by having at least one space.
I came up with this, but I'm wondering if someone has a better and possibly simpler solution?
/([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,}) ([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,})/
This is basically this /([a-zA-Z,.'-]) ([a-zA-Z,.'-])/ plus unicode support.
I'd first make sure that you really do need people to give you a last name. Is that a genuine requirement? If not, I'd skip it because it adds unnecessary complication and barriers to entry. If it really IS a requirement, it probably makes sense to have separate first and last name fields in your UI so that it's explicit.
The fact that you didn't do that to begin with suggests that you might not really need the last name as much as you think you do.
To answer your original question, this expression might give you what you're looking for without the guesswork:
/[\w]+([\s]+[\w]+){1}+/
It checks that the string contains at least 2 words separated by whitespace. Like Tim Pietzcker pointed out, validating the words themselves is prone to error.
In Ruby 1.9, you have access to Unicode properties (\p{L} is a Unicode letter). But trying to validate a name in any way (regex or not) is prone to failure because names are not what you think they are.
Your theory that "if there's a space, there must be a last name there" is incorrect, too - think of first and middle names...
I'm trying to write a user name validation that has the following restrictions:
Must contain at least 1 letter (a-zA-Z)
May not contain anything other than digits, letters, or underscores
The following examples are valid: abc123, my_name, 12345a
The following examples are invalid: 123456, my_name!, _1235
I found something about using positive lookaheads for the letter contraint: (?=.*[a-zA-Z]), and it looks like there could be some sort of negative lookahead for the second constraint, but I'm not sure how to mix them together into one regex. (Note... I am not really clear on what the .* portion does inside the lookahead...)
Is it something like this: /(?=.*[a-zA-Z])(?!.*[^a-zA-Z0-9_])/
Edit:
Because the question asks for a regex, the answer I'm accepting is:
/^[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*$/
However, the thing I'm actually going to implement is the suggestion by Bryan Oakley to split it into multiple smaller checks. This makes it easier to both read and extend in the future in case requirements change. Thanks all!
And because I tagged this with ruby-on-rails, I'll include the code I'm actually using:
validate :username_format
def username_format
has_one_letter = username =~ /[a-zA-Z]/
all_valid_characters = username =~ /^[a-zA-Z0-9_]+$/
errors.add(:username, "must have at least one letter and contain only letters, digits, or underscores") unless (has_one_letter and all_valid_characters)
end
/^[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*$/: 0 or more valid characters followed by one alphabetical followed by 0 or more valid characters, constrained to be both the beginning and the end of the line.
It's easy to check whether the pattern has any illegal characters, and it's easy to check whether there's at least one letter. Trying to do that all in one regular expression will make your code hard to understand.
My recommendation is to do two tests. Put the tests in functions to make your code absolutely dead-simple to understand:
if no_illegal_characters(string) && contains_one_alpha(string) {
...
}
For the former you can use the pattern ^[a-zA-Z0-9_]+$, and for the latter you can use [a-zA-Z].
If you don't like the extra functions that's ok, just don't try to solve the problem with one difficult-to-read regular expression. There are no bonus points awarded for cramming as much functionality into one expression as possible.
the simplest regex that resolve your problem is:
/^[a-zA-Z0-9][a-zA-Z0-9_]*$/
I encourage you to try it out live on http://rubular.com/
Lets say I have a field which accepts A-Z,a-z,0-9 . If I'm trying to communicate to someone, via documenation or api creation "what" my code can accept, i HAVE to say:
A-Z,a-z,0-9
Now that in my mind this is restrictive and error prone.
Compare that to what i'm proposing.
Suppose A-Z,a-z,0-9 was allocated the "code" ANSI456
When I'm communicating that to someone, I can say that my code accepts ANSI456. If someone else was developing a check, there is no confusion on what my code can or cannot accept.
To those who will suggest just specifying character ranges, please note that what i'm envisioning will handle scenarios where even this is defined as a valid "code"
0-9, +, -, *, /
In fact, if its done properly, we can have a site generate automatic code in various languages to accomodate the different "codes".
Okay - i KNOW there are ~ infinite values, eg:
a-z
is different from
a-l,n-z
And these would have two different codes in this "system".
I'm not proposing a HUMAN moderated system - it can be completely automatic BUT systematic way of generating these "codes"
There already is such a standard, although it doesn't have the word "standard" in its name. It is called Perl 5 compatible regular expressions, and it is used in Perl 5, Java, JavaScript, libpcre and many other contexts.
Question
I would like to be able to use a single regex (if possible) to require that a string fits [A-Za-z0-9_] but doesn't allow:
Strings containing just numbers or/and symbols.
Strings starting or ending with symbols
Multiple symbols next to eachother
Valid
test_0123
t0e1s2t3
0123_test
te0_s1t23
t_t
Invalid
t__t
____
01230123
_0123
_test
_test123
test_
test123_
Reasons for the Rules
The purpose of this is to filter usernames for a website I'm working on. I've arrived at the rules for specific reasons.
Usernames with only numbers and/or symbols could cause problems with routing and database lookups. The route for /users/#{id} allows id to be either the user's id or user's name. So names and ids shouldn't be able to collide.
_test looks wierd and I don't believe it's valid subdomain i.e. _test.example.com
I don't like the look of t__t as a subdomain. i.e. t__t.example.com
This matches exactly what you want:
/\A(?!_)(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*(?<!_)\z/i
At least one alphabetic character (the [a-z] in the middle).
Does not begin or end with an underscore (the (?!_) and (?<!_) at the beginning and end).
May have any number of numbers, letters, or underscores before and after the alphabetic character, but every underscore must be separated by at least one number or letter (the rest).
Edit: In fact, you probably don't even need the lookahead/lookbehinds due to how the rest of the regex works - the first ?: parenthetical won't allow an underscore until after an alphanumeric, and the second ?: parenthetical won't allow an underscore unless it's before an alphanumeric:
/\A(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*\z/i
Should work fine.
I'm sure that you could put all this into one regular expression, but it won't be simple and I'm not sure why insist on it being one regex. Why not use multiple passes during validation? If the validation checks are done when users create a new account, there really isn't any reason to try to cram it into one regex. (That is, you will only be dealing with one item at a time, not hundreds or thousands or more. A few passes over a normal sized username should take very little time, I would think.)
First reject if the name doesn't contain at least one number; then reject if the name doesn't contain at least one letter; then check that the start and end are correct; etc. Each of those passes could be a simple to read and easy to maintain regular expression.
What about:
/^(?=[^_])([A-Za-z0-9]+_?)*[A-Za-z](_?[A-Za-z0-9]+)*$/
It doesn't use a back reference.
Edit:
Succeeds for all your test cases. Is ruby compatible.
This doesn't block "__", but it does get the rest:
([A-Za-z]|[0-9][0-9_]*)([A-Za-z0-9]|_[A-Za-z0-9])*
And here's the longer form that gets all your rules:
([A-Za-z]|([0-9]+(_[0-9]+)*([A-Za-z|_[A-Za-z])))([A-Za-z0-9]|_[A-Za-z0-9])*
dang, that's ugly. I'll agree with Telemachus, that you probably shouldn't do this with one regex, even though it's technically possible. regex is often a pain for maintenance.
The question asks for a single regexp, and implies that it should be a regexp that matches, which is fine, and answered by others. For interest, though, I note that these rules are rather easier to state directly as a regexp that should not match. I.e.:
x !~ /[^A-Za-z0-9_]|^_|_$|__|^\d+$/
no other characters than letters, numbers and _
can't start with a _
can't end with a _
can't have two _s in a row
can't be all digits
You can't use it this way in a Rails validates_format_of, but you could put it in a validate method for the class, and I think you'd have much better chance of still being able to make sense of what you meant, a month or a year from now.
Here you go:
^(([a-zA-Z]([^a-zA-Z0-9]?[a-zA-Z0-9])*)|([0-9]([^a-zA-Z0-9]?[a-zA-Z0-9])*[a-zA-Z]+([^a-zA-Z0-9]?[a-zA-Z0-9])*))$
If you want to restrict the symbols you want to accept, simply change all [^a-zA-Z0-9] with [] containing all allowed symbols
(?=.*[a-zA-Z].*)^[A-Za-z0-9](_?[A-Za-z0-9]+)*$
This one works.
Look ahead to make sure there's at least one letter in the string, then start consuming input. Every time there is an underscore, there must be a number or a letter before the next underscore.
/^(?![\d_]+$)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$/
Your question is essentially the same as this one, with the added requirement that at least one of the characters has to be a letter. The negative lookahead - (?![\d_]+$) - takes care of that part, and is much easier (both to read and write) than incorporating it into the basic regex as some others have tried to do.
[A-Za-z][A-Za-z0-9_]*[A-Za-z]
That would work for your first two rules (since it requires a letter at the beginning and end for the second rule, it automatically requires letters).
I'm not sure the third rule is possible using regexes.