I have a list of approx 100,000 names I need to process. Some are business names, some are people names. Unfortunately, some are lower, some are upper, and some are mixed. I am looking for a routine to convert them to proper case. (Sometimes called Mixed or Title case). I realize I can just loop through the string and capitalize every character that starts a new word. That would be an incredibly simplistic approach. For businesses, short words should be lowercase (of, with, for, ...). For last names, if it starts with Mc, the 3rd letter should be capitalized (McDermot, McDonald, etc). Roman numerals should always be capitalized (John Smith II ), etc.
I have not been able to find any Delphi built in, or otherwise, routines. Surely this is out there. Where can I find this?
Thanks
As it was already said by others, making a fully automated routine for this is nearly impossible due to so many special variations. So leaving out the human interaction completely is almost impossible.
Now what you can do instead is to make this much easier for human to solve. How? Make a dictionary of all the name variations in Lowercase and present it to him.
Before presenting the names you can make sure that the first letter in any of the names is already capitalized.
Once all name correction has been made in dictionary you go and automatically replace all the names in original database.
Related
So I am working on a artist classification project that utilizes hip hop lyrics from genius.com. The problem is these lyrics are user generated, so the same word can be spelled in various different ways, especially if it is slang which is a very common case in hip hop.
I looked into spell correction using hunspell/pyhunspell, but the problem with that is it doesn't fix slang misspellings. I technically could make a mini dictionary with a bunch of misspelled variations but that is effectively useless because there could be a dozen variations of the same word over my (growing) 6000 song corpus.
Any suggestions?
You could try to stem your words. More information on stemming here. This would help grouping together words with close spelling variations.
A popular stemming scheme is the Porter Stemmer, which implementation can be found in most NLP packages, eg. NLTK
I would discard, if possible, short words, or contracted words which somehow are too hard to automatically correct them (conditioned on checking that it won't affect your final result).
For longer words, you may want to use metrics like Levenshtein distance or Jaro similarity. The first one consists of the minimum number of additions, deletes or replaces to convert one candidate word into another. The second one, provides a similar result, between 0 and 1, and putting more emphasis in the last characters of a word.
If you have access to the correct version of your slang word, you could convert the closest candidates to the correct one. Of course, trying not to apply it to different correct words.
If you're working with Python, here some implementations are provided.
I need to make sure people enter their first, middle and last names correctly for a form in Rails. So the first thought for a regular expression is:
\A[[:upper:]][[:alpha:]'-]+( [[:upper:]][[:alpha:]'-]*)*\z
That'll make sure every word in the name starts with an uppercase letter followed by a letter or hyphen or apostrophe.
My first question I guess doesn't have much to do with regular expressions, though I'm hoping there's a regular expression I can copy for this. Are letters, hyphens and apostrophes the only characters I should be checking in a name?
My second question is if it's important to make sure each name has at least 1 uppercase letter? So many people enter all lowercase names and I really want to avoid that, but is it sometimes legitimate?
Here's what I have so far that makes sure there's at least 1 uppercase letter somewhere in the name:
\A([[:alpha:]'-]+ )*[[:alpha:]'-]*[[:upper:]][[:alpha:]'-]*( [[:alpha:]'-]+)*\z
Isn't there a [:name:] bracket expression? :)
UPDATE: I added . and , to the characters allowed, surprised I didn't think of them originally. So many people must have to deal with this kind of regular expression! Nobody has any pre-made regular expressions for this sort of thing?
A good start would be to allow letters, marks, punctiation and whitespace. To allow for a given name like "María-Jose" and a last name like "van Rossum" (note the whitespace).
So that boils down to something like:
[\p{Letter}\p{Mark}\p{Punctuation}\p{Separator}]+
If you want to restrict that a bit you could have a look at classes like \p{Lowercase_Letter}, \p{Uppercase_Letter}, \p{Titlecase_Letter}, but there may be scripts that don't have casing. \p{Space_Separator} and \p{Dash_Punctuation} can narrow it down to names that I know. But names I don't...I don't know...
But before you start constructing your regex for "validating" a name. Please read this excellent piece on names by W3C. It will shake even your concepts of first, middle and last names.
For example:
In some cultures you are given a name (Björk, Osama) and an indication of who your father (or mother) was (Guðmundsdóttir, bin Mohammed). So the "first name" could be "Björk" but:
Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.
But in other cultures, the first name is not given, but a family name. In "Zhāng Mànyù", "Zhāng" is the family name. And how to address her, would depend how well you know her, but again "Ms. Zhāng" would be strange.
The list of examples goes on and ends in a some 30+ links to Wikipedia for more examples.
The article does end with suggestions for field design and some pointers on what characters to allow:
Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. Don't require names to be entered all in upper case – this can be difficult on a mobile device. Allow the user to enter a name with spaces , eg. to support prefixes and suffixes such as de in French, von in German, and Jnr/Jr in American names, and also because some people consider a space-separated sequence of characters to be a single name, eg. Rose Marie.
To answer your question about capital letters: in many areas of the world, names do not necessarily start with a capital letter. In Dutch for instance, you have surnames like "van der Vliet" where words like "van", "de", "den" and "der" are not capitalised. Additionally, you have special cases like "De fauw" and "Van pellicom" where an administrative error never got rectified, and the correct capitalisation is fairly illogical. Please do not make the mistake of rejecting such names.
I also know about town names in South Africa such as eThekwini, where the capital letter is not necessarily the first letter of the word. This could very well appear in surnames or given names as well.
Given a phrase that is dynamically constructed with portions present or removed based on parameters, what are some possible solutions for supporting localization? For example, consider the following two phrases with bold parts that represent dynamically inserted portions:
The dog is spotted, has a doghouse and is chasing a ball.
The dog is white, and is running in circles.
For English, this can be solved by simply concatenating the phrase portions or perhaps having a few token-filled strings in a resource file that can be selected based on parameters. But these solutions won't work or get ugly quickly once you need to localize for other languages or have more parameters. In the example above, assuming that the dog appearance is the only portion always present, a localized resource implementation might consist of the following resource strings:
AppearanceOnly: The dog is %appearance%.
ActivityOnly: The dog is %appearance% and is %activity%.
AssessoryOnly: The dog is %appearance% and has %accessory%.
AccessoryActivity: The dog is %appearance%, has %accessory% and is %activity%.
While this works, the required number of strings grows exponentially depending upon the number of parameters.
Been searching far and wide for best practices that might help me with this challenge. The only solution I have found is to simply reword the phrase—but you lose the natural sentence structure, which I really don't want to do:
Dog: spotted, doghouse, chasing ball
Suggestions, links, thoughts, examples, or "You're crazy, just reword it!" feedback is welcome :) Thanks!
The best approach is probably to divide the sentence to separate sentences, like “The dog is spotted. The dog has a doghouse. The dog is chasing a ball.” This may look boring, but if you would replace all occurrences of “the dog” except the first one, you have a serious pronoun problem. In many languages, the pronoun to be used would depend on the noun it refers to. (Even in English, it is not quite clear whether a dog is he, she, or it.)
The reason for separation is that different languages have different verb systems. For example, in Russian, you cannot really combine the three sentences into one sentence that has three verbs sharing a subject. (In Russian, you don’t use the verb “to be” in present tense – instead, you would just say the equivalent of “Dog – spotted”, and there is no verb corresponding to “to have” – instead, you use the equivalent of “at dog doghouse”. Finnish is similar with respect to “to have”. Such issues are sometimes handled, in “forced” localizations, by using a word that corresponds to “to possess” or “to own”, but the result is odd-looking, to put it mildly.)
Moreover, languages have different natural orders for subject, verb, and object. Your initial approach implicitly postulates a SVO order. You should not assume that the normal, unmarked word order always starts with the subject. Instead of using sentence patterns like "%subject% %copula% %appearance% (where %copula% is “is”, “are”, or “am” in English), you would need to call a function with two parameters, subject and appearance, returning a sentence that has a language-dependent copula, or no copula, and that has a word order determined by the rules of the language. Yes, it gets complicated; localization of generated statements gets rather complicated as soon as you deal with anything but structurally very similar languages.
I'm using a lookup table for optimizing an algorithm that works on single characters. Currently I'm adding a..z, A..Z, 0..9 to the lookup table. This works fine in european countries, but in asian countries it doesn't make much sense.
My idea was that I could perhaps use the characters in the windows default code page as an alphabet for the lookup table.
Pseudocode:
for Ch in DefaultCodePage.Characters do
LookupTable.Add (Ch, ComputeValue (Ch));
What do you think and how could this be achieved? Any alternative suggestions?
As you mentioned, it does not make much sense for different scripts. It may only make some sense for alphabet-based languages.
BTW. A-Z is not enough for most of European languages.
I don't quite know what you are doing and what you need this look-up table for but it seems that what you are looking for are Index Characters. You could find such information in CLDR – look for indexCharacters. The resources for various languages are available here.
The only problem you'll face that in fact for some languages Index Characters tend to be Latin based. That is just because these languages do not actually have them... In that case you might want to use so called Exemplar Characters instead but please be warned that it might be just not enough for some use cases.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.
Plus, there are all of the international characters, like umlauts, etc.
So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?
Bonus question: How many name fields are needed, and what are their maximum lengths?
EDIT: There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.
For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.
The problem exists for dashes, probably quotes as well, but also how many other symbols?
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.
Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.
If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.
It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.
Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.
I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...
There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.
However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.
My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.
Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.
Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.
On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.
Many people are known by their middle name, and formally use a first initial, middle name, last name format.
In some cultures, the surname is the first name, and the given name is the last name.
Multiple first and/or middle given names is getting more common. As #Dour High Arch points out, the other extreme is people with only one word in their name.
In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.
I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.
I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.
From StackOverflow, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.
So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.
Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)
It really depends on what the app is supposed to be used for.
Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?
You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.
UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.
Depending on the complexity of your name structure I could see:
First Name
Middle Initial/Middle Name
Last Name
Suffix (Jr. Sr. II, III, IV, etc.)
Prefix (Mr., Mrs., Ms., etc.)
What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).
It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.