Name fixing / validation? - ruby-on-rails

Frequently, I have found, users enter very poorly formatted names when they register. I get all kinds of crazy formatting from Paypal IPN and other payment gateways even from all lower case to all caps to just flat out messed up.
One thing I do with this information is to send out emails and offer greetings, however I dislike the poorly formatted names. Has someone thought about this before and figured out a happy middle road solution? For example, I realize it would be poor form to simply correct spellings that are seemingly errors, but it would be wise to at least fix "what is reasonable." At the minimum that would be capitalization. Perhaps simply upcasing the first letters of each distinct "word" in the first and last name strings would be sufficient?
Or is there a a better method? Perhaps a database of common name capitalizations for things like "McBerry" and "van Buuren"? A gem or some such tool? Just kind of curious. Perhaps it is foolish to put this much thought into this topic, but I really like to be as courteous and professional as possible in my communications with users vs just using a poorly formatted name as is the usual.

The best you can hope to do is capitalize the first letter of their first/last/middle name:
"bob".capitalize => "Bob"
From Ruby:
capitalize → new_str click to toggle source
Returns a copy of str with the first character converted to uppercase and the remainder to lowercase. Note: case conversion is effective only in ASCII region.
"hello".capitalize #=> "Hello"
"HELLO".capitalize #=> "Hello"
"123ABC".capitalize #=> "123abc"
You can also use downcase to level everything out, then capitalize to make it "right".
For instance:
fName = "jIMMY"
lName = "sMITH"
fName.downcase
lName.downcase
fName.capitalize
lNmae.capitalize
puts fName + lName => Jimmy Smith
However, with names like VanBuuren, it might be a little harder.
Here is a link to Ruby strings which has some methods that might help you on your quest.
http://www.ruby-doc.org/core-2.0/String.html

Related

Delphi - create Title/Proper/Mixed Case for Strings

I have a list of approx 100,000 names I need to process. Some are business names, some are people names. Unfortunately, some are lower, some are upper, and some are mixed. I am looking for a routine to convert them to proper case. (Sometimes called Mixed or Title case). I realize I can just loop through the string and capitalize every character that starts a new word. That would be an incredibly simplistic approach. For businesses, short words should be lowercase (of, with, for, ...). For last names, if it starts with Mc, the 3rd letter should be capitalized (McDermot, McDonald, etc). Roman numerals should always be capitalized (John Smith II ), etc.
I have not been able to find any Delphi built in, or otherwise, routines. Surely this is out there. Where can I find this?
Thanks
As it was already said by others, making a fully automated routine for this is nearly impossible due to so many special variations. So leaving out the human interaction completely is almost impossible.
Now what you can do instead is to make this much easier for human to solve. How? Make a dictionary of all the name variations in Lowercase and present it to him.
Before presenting the names you can make sure that the first letter in any of the names is already capitalized.
Once all name correction has been made in dictionary you go and automatically replace all the names in original database.

Regexp for a name

I need to make sure people enter their first, middle and last names correctly for a form in Rails. So the first thought for a regular expression is:
\A[[:upper:]][[:alpha:]'-]+( [[:upper:]][[:alpha:]'-]*)*\z
That'll make sure every word in the name starts with an uppercase letter followed by a letter or hyphen or apostrophe.
My first question I guess doesn't have much to do with regular expressions, though I'm hoping there's a regular expression I can copy for this. Are letters, hyphens and apostrophes the only characters I should be checking in a name?
My second question is if it's important to make sure each name has at least 1 uppercase letter? So many people enter all lowercase names and I really want to avoid that, but is it sometimes legitimate?
Here's what I have so far that makes sure there's at least 1 uppercase letter somewhere in the name:
\A([[:alpha:]'-]+ )*[[:alpha:]'-]*[[:upper:]][[:alpha:]'-]*( [[:alpha:]'-]+)*\z
Isn't there a [:name:] bracket expression? :)
UPDATE: I added . and , to the characters allowed, surprised I didn't think of them originally. So many people must have to deal with this kind of regular expression! Nobody has any pre-made regular expressions for this sort of thing?
A good start would be to allow letters, marks, punctiation and whitespace. To allow for a given name like "María-Jose" and a last name like "van Rossum" (note the whitespace).
So that boils down to something like:
[\p{Letter}\p{Mark}\p{Punctuation}\p{Separator}]+
If you want to restrict that a bit you could have a look at classes like \p{Lowercase_Letter}, \p{Uppercase_Letter}, \p{Titlecase_Letter}, but there may be scripts that don't have casing. \p{Space_Separator} and \p{Dash_Punctuation} can narrow it down to names that I know. But names I don't...I don't know...
But before you start constructing your regex for "validating" a name. Please read this excellent piece on names by W3C. It will shake even your concepts of first, middle and last names.
For example:
In some cultures you are given a name (Björk, Osama) and an indication of who your father (or mother) was (Guðmundsdóttir, bin Mohammed). So the "first name" could be "Björk" but:
Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.
But in other cultures, the first name is not given, but a family name. In "Zhāng Mànyù", "Zhāng" is the family name. And how to address her, would depend how well you know her, but again "Ms. Zhāng" would be strange.
The list of examples goes on and ends in a some 30+ links to Wikipedia for more examples.
The article does end with suggestions for field design and some pointers on what characters to allow:
Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. Don't require names to be entered all in upper case – this can be difficult on a mobile device. Allow the user to enter a name with spaces , eg. to support prefixes and suffixes such as de in French, von in German, and Jnr/Jr in American names, and also because some people consider a space-separated sequence of characters to be a single name, eg. Rose Marie.
To answer your question about capital letters: in many areas of the world, names do not necessarily start with a capital letter. In Dutch for instance, you have surnames like "van der Vliet" where words like "van", "de", "den" and "der" are not capitalised. Additionally, you have special cases like "De fauw" and "Van pellicom" where an administrative error never got rectified, and the correct capitalisation is fairly illogical. Please do not make the mistake of rejecting such names.
I also know about town names in South Africa such as eThekwini, where the capital letter is not necessarily the first letter of the word. This could very well appear in surnames or given names as well.

How- XLST Transformation

Just wanted to ask on how to get the author names in the given xml sample below and put an attribut of eq="yes". EQ means Equal Contributors.
This is the XML.
<ArticleFootnote Type="Misc">
<Para>John Doe and Jane Doe are equal contributors.</Para>
</ArticleFootnote>
This should be the output in other form of XML.
<AuthorGroups>
<Authors eq="yes">John Doe</Authors>
<Authors eq="yes">Jane Doe</Authors>
</AuthorGroups>
Assuming that JOhn Doe and Jane Doe are already defined in the list of authors but after the transformation, author tag should have the attribute eq="yes". Please help as I don't know much writing in xlst.
Thanks in advance.
There's not really enough information here to give you a clear answer.
If you have a list of authors, you could use fn:match() on each author's name in turn, maybe after changing space to \s+ in the pattern.
I normally use Perl to do this sort of thing, though, being careful not to disrupt the tagging structure.
In any case you'll either need to process the text a word at a time, probably recursively to find the longest match in cases where one name is just "John" and another is "John Doe". Watch that you don't add markup to names you've already processed.
In the case that the text really always says exactly what you have there, but with different names, though, you could have a template to match ArticleFootnote/Para[contains(., 'are equal contrubutors')] and either use substring() and substring-before() or the XSLT 2 pattern matching.

regex for a full name

I've recently been receiving a lot of first name only entries in a form. While maybe I should have had 2 separate first and last name fields this always seemed to me a bit much. But I would like to try and get a full name which basically can only be determined by having at least one space.
I came up with this, but I'm wondering if someone has a better and possibly simpler solution?
/([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,}) ([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,})/
This is basically this /([a-zA-Z,.'-]) ([a-zA-Z,.'-])/ plus unicode support.
I'd first make sure that you really do need people to give you a last name. Is that a genuine requirement? If not, I'd skip it because it adds unnecessary complication and barriers to entry. If it really IS a requirement, it probably makes sense to have separate first and last name fields in your UI so that it's explicit.
The fact that you didn't do that to begin with suggests that you might not really need the last name as much as you think you do.
To answer your original question, this expression might give you what you're looking for without the guesswork:
/[\w]+([\s]+[\w]+){1}+/
It checks that the string contains at least 2 words separated by whitespace. Like Tim Pietzcker pointed out, validating the words themselves is prone to error.
In Ruby 1.9, you have access to Unicode properties (\p{L} is a Unicode letter). But trying to validate a name in any way (regex or not) is prone to failure because names are not what you think they are.
Your theory that "if there's a space, there must be a last name there" is incorrect, too - think of first and middle names...

Regex: Match a string containing numbers and letters but not a string of just numbers

Question
I would like to be able to use a single regex (if possible) to require that a string fits [A-Za-z0-9_] but doesn't allow:
Strings containing just numbers or/and symbols.
Strings starting or ending with symbols
Multiple symbols next to eachother
Valid
test_0123
t0e1s2t3
0123_test
te0_s1t23
t_t
Invalid
t__t
____
01230123
_0123
_test
_test123
test_
test123_
Reasons for the Rules
The purpose of this is to filter usernames for a website I'm working on. I've arrived at the rules for specific reasons.
Usernames with only numbers and/or symbols could cause problems with routing and database lookups. The route for /users/#{id} allows id to be either the user's id or user's name. So names and ids shouldn't be able to collide.
_test looks wierd and I don't believe it's valid subdomain i.e. _test.example.com
I don't like the look of t__t as a subdomain. i.e. t__t.example.com
This matches exactly what you want:
/\A(?!_)(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*(?<!_)\z/i
At least one alphabetic character (the [a-z] in the middle).
Does not begin or end with an underscore (the (?!_) and (?<!_) at the beginning and end).
May have any number of numbers, letters, or underscores before and after the alphabetic character, but every underscore must be separated by at least one number or letter (the rest).
Edit: In fact, you probably don't even need the lookahead/lookbehinds due to how the rest of the regex works - the first ?: parenthetical won't allow an underscore until after an alphanumeric, and the second ?: parenthetical won't allow an underscore unless it's before an alphanumeric:
/\A(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*\z/i
Should work fine.
I'm sure that you could put all this into one regular expression, but it won't be simple and I'm not sure why insist on it being one regex. Why not use multiple passes during validation? If the validation checks are done when users create a new account, there really isn't any reason to try to cram it into one regex. (That is, you will only be dealing with one item at a time, not hundreds or thousands or more. A few passes over a normal sized username should take very little time, I would think.)
First reject if the name doesn't contain at least one number; then reject if the name doesn't contain at least one letter; then check that the start and end are correct; etc. Each of those passes could be a simple to read and easy to maintain regular expression.
What about:
/^(?=[^_])([A-Za-z0-9]+_?)*[A-Za-z](_?[A-Za-z0-9]+)*$/
It doesn't use a back reference.
Edit:
Succeeds for all your test cases. Is ruby compatible.
This doesn't block "__", but it does get the rest:
([A-Za-z]|[0-9][0-9_]*)([A-Za-z0-9]|_[A-Za-z0-9])*
And here's the longer form that gets all your rules:
([A-Za-z]|([0-9]+(_[0-9]+)*([A-Za-z|_[A-Za-z])))([A-Za-z0-9]|_[A-Za-z0-9])*
dang, that's ugly. I'll agree with Telemachus, that you probably shouldn't do this with one regex, even though it's technically possible. regex is often a pain for maintenance.
The question asks for a single regexp, and implies that it should be a regexp that matches, which is fine, and answered by others. For interest, though, I note that these rules are rather easier to state directly as a regexp that should not match. I.e.:
x !~ /[^A-Za-z0-9_]|^_|_$|__|^\d+$/
no other characters than letters, numbers and _
can't start with a _
can't end with a _
can't have two _s in a row
can't be all digits
You can't use it this way in a Rails validates_format_of, but you could put it in a validate method for the class, and I think you'd have much better chance of still being able to make sense of what you meant, a month or a year from now.
Here you go:
^(([a-zA-Z]([^a-zA-Z0-9]?[a-zA-Z0-9])*)|([0-9]([^a-zA-Z0-9]?[a-zA-Z0-9])*[a-zA-Z]+([^a-zA-Z0-9]?[a-zA-Z0-9])*))$
If you want to restrict the symbols you want to accept, simply change all [^a-zA-Z0-9] with [] containing all allowed symbols
(?=.*[a-zA-Z].*)^[A-Za-z0-9](_?[A-Za-z0-9]+)*$
This one works.
Look ahead to make sure there's at least one letter in the string, then start consuming input. Every time there is an underscore, there must be a number or a letter before the next underscore.
/^(?![\d_]+$)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$/
Your question is essentially the same as this one, with the added requirement that at least one of the characters has to be a letter. The negative lookahead - (?![\d_]+$) - takes care of that part, and is much easier (both to read and write) than incorporating it into the basic regex as some others have tried to do.
[A-Za-z][A-Za-z0-9_]*[A-Za-z]
That would work for your first two rules (since it requires a letter at the beginning and end for the second rule, it automatically requires letters).
I'm not sure the third rule is possible using regexes.

Resources