Parsing a full name into its constituents - parsing

We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.

I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.

Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm

The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.

Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.

Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser

I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }

A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues

"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.

As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...

Related

Text recommendation based on keywords

I need some advice on the following problem.
I'm given a set of weighted keywords (by percentage) and need to find a text in a database that best matches those keywords. I will give an example.
I'm presented with these keywords
Sun(90%)
National Park(85% some keywords contain 2 words)
Landmark(60%)
Now lets say my database contains 3 entries of texts e.g
Going-to-the-Sun Road is a scenic mountain road in the Rocky Mountains of the western United States, in Glacier National Park in Montana.
Everybody has a little bit of the sun and moon in them. Everybody has a little bit of man, woman, and animal in them.
A hybrid car is one that uses more than one means of propulsion - that means combining a petrol or diesel engine with an electric motor.
Obviously the first text is the one that best describes the given set of keywords so this is what I want to recommend to the user. Following the second text that somewhat relates with the "sun" keyword and that could be an acceptable choice too.
The 3rd text is totally irrelevant and shall only be recommended as a last resort when everything else fails.
I'm totally new to that kind of stuff so I need some advice as to which technologies/algorithms I should use. Seems like there is some machine learning (nlp) involved or some kind of fuzzy logic. I'm not really sure.
You need to use a combination of query terms boosting and synonyms
Look into Is there a way to do fuzzy string matching for words on string?

HL7 PID segment

I have some misunderstanding issues with HL7 specially in PID segment.
if we have a patient has two different names, how can we build the PID-5 using the two names?
Example
the previous name
Han John Burke
the current name
Han Robert Mat
Any idea guys
PID-5 is a repeating field:
|name-1^components~name-2^components|
if a system doesn't support repeating components, it doesn't support multiple names
You may use the repeating character to separate the two names within PID.5, but ultimately it depends on what the system you're integrating with accepts. I found a standard HL7 spec from one EMR vendor that states if the HL7 version is greater than 2.2, PID.5.7 will populate with L, and if there are multiple other names PID.5.7 will populate with an M. The M may be a required flag to show that there is an additional name. For HL7 versions less than 2.2 PID.5.7 may not be used. You may need to have additional carats put in place until the 7th sub-component for it to work as well.
I tested this in our test system and the result is below. You can see the M was used multiple times because there were multiple other versions of this person's name. Hope this helps! Thanks!
|TEST^FAITH^^^^^M~test^hello^^^^^M~maiden^faith^^^^^M|

Parsing six figure Ordnance Survey Grid Reference input

An extensive search has produced no answer to the question, "Is there a class or function that parses input for soundness relating to UK Ordnance Survey Grid References".
The UK is mapped by the UK Ordnance Survey who produce detailed maps of the United Kingdom with many types of referencing. One of these is the commonly used six figure Grid Reference, we are at SO896804.
We already use a postcode (zip) checker to make sure that the information entered into the postcode field is sound, but we can't find the same for the OS Grid Reference.
Does such a Grid Reference function exist, or do we down tools and write one?
Thank you.
Since you tagged this "parsing", rather than using some GIS-like tag, I'd say that a reasonably valid OS grid reference corresponds to the regular expression:
(H[PTUWXYZ]|N[ABCDFGHJKLMNORQSTUWXYZ]|OV|S[CDEHJKMNOPRSTUVWXYZ]|T[AFGLMQRV])([0-9]{6})
If you were prepared to accept 4-digit 100ha blocks as well as the six-digit 1ha blocks, you could replace the second parenthesized expression with ([0-9]{4}|[0-9]{2}).
Of course, some of the two-letter blocks are almost completely marine. You could almost certainly ignore OV, for example. NW contains a little bit of Dumfries & Galloway (Portpatrick, NW995545), and a much larger but irrelevant part of Northern Ireland.

iOS - Check pasteboard for valid Mailing Address

I am looking for some guidance for how I could check the pasteboard in iOS for a valid mailing address.
If someone pastes
1234 Apple Street
New York, NY 10011
It parses each part of the string to fill in Address, City, State and Zip. It could be any address and It would be ideal if it could be found inside a longer string.
For example
Meet me at 1234 Apple Street New York, NY 10011 See you there!
Still will parse the correct Address, City, State and Zip.
Any help would be much appreciated!
-Wes
I was a developer at SmartyStreets. We were kind of crazy about street addresses, and street addresses drove me crazy (especially parsing them). It's a two-way street. (Am I done with the street puns?)
First, let's talk about the case where the address is all by itself, because that's easier, albeit still difficult...
Please reference this other question and answer about the very same thing. I also strongly encourage you follow the links to related questions in both the question and the answer. Parsing addresses is a can of worms, but it's not impossible. It's just really hard to do it reliably.
Notice in the answer to that question how many different formats valid addresses can appear in. What guarantees do you have that the user will type it in any of those? And that's only a few. There are others. Consider military, PO box, rural route, and other "special" addresses that don't adhere to the typical format. What about addresses that have a two-or-three-word city name? What about addresses that use a grid system like 100 N 500 E, or secondary numbers like suite, apartment, floor, etc? What about addresses with "1/2", hyphens (as a required punctuation), etc? Addresses missing zip codes or city/state?
All of these and more could be valid. And that's only for US addresses.
If all your addresses, or even most of them (which isn't the case), came in the form like you proposed above, as an example:
[Primary Number] [Street Name] [Any of these street suffixes]
[City Name Followed by a Comma], [State Abbreviation] [5-digit ZIP code]
Then this would be quite easy. Wouldn't that be nice?
You could try to write a regular expression like this guy or that guy, but that only works if addresses are a regular language. They're not regular, and regular expressions are not the answer.
There are a few services which can do this for you because they have a master list (kind of), and the software has to meet rigorous certification standards.
Obviously, since I work at SmartyStreets, I'm prone to suggest starting your search for an answer there. You can try some freeform addresses on the homepage (just fill out the "Street" field). But be aware of a few things that will probably always be an issue. LiveAddress API will be able to parse street addresses for you, most of the time. Shop around, but this should give you an idea.
Now your second question: extract a street address from a string of text. This has been extensively covered elsewhere on S.O. and the interwebs, so I won't go into a lot of detail. Basically, to do this reliably, you'll probably need some Natural Language Processing and human interaction to confirm or correct the best guess.
Don't ever assume these things about un-standardized addresses:
Starts with a number
Ends with a number
Everything between the two numbers is an address
Has a ZIP code present
No more than 2 numbers will be in an address
It's unambiguous
It exists
A street suffix will always be present
It's spelled correctly
...etc.
Again, refer to some other linked posts about this issue. You can make guesses, but always always always have a human confirm the guess if you do that. (Some Mac apps do this. If they detect an address, it will get highlighted, and you can add that address to your contacts. Unfortunately I've seen false positives a lot, and it also misses them a lot.)
Good luck!
I also work at SmartyStreets, and since I'm not a developer I'm not bound by any constraints such as "it can't be done" or "there's no way to do it reliably". In fact the ideas that I come up with may not even always be possible, but, I'm a problem-solver, a solution-finder, and this particular problem absolutely has a solution.
You'll need the following: a little regex, knowledge of a scripting language (python, php, whatever you prefer) and access to an address validation tool (this is required so that you know when you get it right).
So, let's start with the example sentence:
Meet me at 1234 Apple Street New York, NY 10011 See you there!
We can be sure that every address has a beginning and an end. (you can take that to the bank!)
So, if you run a regular expression that looks for the beginning of the address within the string you can eliminate everything before the address begins. Here's a regex that will do just that:
(^(.*(?=p\.?o\.? box|h\.?c\.?r\.? |c\.?m\.?r\.?)|^[^0-9]+))
This will give you back the following:
1234 Apple Street New York, NY 10011 See you there!
Now, you're halfway there but you'll need to loop through the remaining string. Another assumption that you can certainly make is that an address will never be longer than 328 charachters long (I made up that number, but you get the picture. An address has to have an end as well and you can shorten the string by determining the max acceptable USPS address length.)
You're going to loop through the address string until you get a valid address out of it. To do this, start at the beginning and move one word to the right with each additional permutation. This is where the address validation service come in handy, because you have no idea where the address ends and that's what you need to know. So, each permutation you generate from the string (remember, you're starting from the left side) will be sent for validation. Since no valid address can have fewer than two words, You'll start there. Here are the permutations from the example address as well as the validation results (I'm trying each address by entering it in the address line of the address search box on smartystreets.com:
1234 Apple ==> fail
1234 Apple Street ==> fail
1234 Apple Street New ==> fail
1234 Apple Street New York ==> fail
1234 Apple Street New York, NY ==> Bingo, valid address match. No need to keep iterating.
Now, obviously this is not a valid address but you can try the same thing with a real address and you'll get the same results. Obviously this isn't the most sophisticated method to extract a valid address from a string but it certainly works. And, since SmartyStreets allows you to send up to 100 addresses per query, you could permute the address string up to 99 times and get the results back in under 300ms. This won't work with every address, as you'll certainly find out, but it can very easily handle a large majority of them, regardless of how obscured the address is within the text string.
So, we started with this meet me at 1234 Apple Street New York, NY 10011 See you there! and within less than half a second came up with this 1234 Apple Street New York, NY 10011-1000.
Pretty cool huh? It even sounds really easy coming from a non-programmer.
Let's try it with a real address:
Meet me at 4219 jon young orlando fl 32839 See you there!
Apply regex and you get:
4219 jon young orlando fl 32839 See you there!
Permute, iterate, validate:
4219 jon ==> fail
4219 jon young ==> fail
4219 jon young orlando ==> fail
4219 jon young orlando fl ==> Bingo, valid address match.

What are all of the allowable characters for people's names? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.
Plus, there are all of the international characters, like umlauts, etc.
So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?
Bonus question: How many name fields are needed, and what are their maximum lengths?
EDIT: There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.
For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.
The problem exists for dashes, probably quotes as well, but also how many other symbols?
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.
Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.
If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.
It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.
Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.
I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...
There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.
However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.
My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.
Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.
Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.
On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.
Many people are known by their middle name, and formally use a first initial, middle name, last name format.
In some cultures, the surname is the first name, and the given name is the last name.
Multiple first and/or middle given names is getting more common. As #Dour High Arch points out, the other extreme is people with only one word in their name.
In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.
I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.
I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.
From StackOverflow, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.
So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.
Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)
It really depends on what the app is supposed to be used for.
Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?
You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.
UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.
Depending on the complexity of your name structure I could see:
First Name
Middle Initial/Middle Name
Last Name
Suffix (Jr. Sr. II, III, IV, etc.)
Prefix (Mr., Mrs., Ms., etc.)
What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).
It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.

Resources