Where can I find a guide to different postal address formats that are used in the major countries in the world?
For example, in the U.S. one format is:
street_number street_name street_type
city, state zipcode
But in Germany it might be:
street_name street_number
postcode city
I have not found any definitive resource (there may not be one), but have found a several, which do not always agree. The first two seem helpful--though they seem to focus on addressing mail from the US to other countries. The third, a book on MSDN, varied significantly from the others and may be out of date.
Address Doctor Address Formats
Columbia U. FRANK'S COMPULSIVE GUIDE TO POSTAL ADDRESSES
Developing International Software Appendix V International Address Formats
Lots of information here for many countries (first hit on google for "international postal address format")
Have a look at this Wikipedia page: http://en.wikipedia.org/wiki/Address_(geography)
There does seem to be a definitive resource here: Universal Postal Union Postal Addressing Systems in Member Countries.
Each country's postal service submits its addressing format to the Union Postale Universelle (Universal Postal Union). The Union has this information available for download and it is the only authoritative resource on the subject matter. The postal services around the world use it.
This material is available here: English; French.
If the links become unavailable, try searching on the Union's website for "addressing".
Related
An extensive search has produced no answer to the question, "Is there a class or function that parses input for soundness relating to UK Ordnance Survey Grid References".
The UK is mapped by the UK Ordnance Survey who produce detailed maps of the United Kingdom with many types of referencing. One of these is the commonly used six figure Grid Reference, we are at SO896804.
We already use a postcode (zip) checker to make sure that the information entered into the postcode field is sound, but we can't find the same for the OS Grid Reference.
Does such a Grid Reference function exist, or do we down tools and write one?
Thank you.
Since you tagged this "parsing", rather than using some GIS-like tag, I'd say that a reasonably valid OS grid reference corresponds to the regular expression:
(H[PTUWXYZ]|N[ABCDFGHJKLMNORQSTUWXYZ]|OV|S[CDEHJKMNOPRSTUVWXYZ]|T[AFGLMQRV])([0-9]{6})
If you were prepared to accept 4-digit 100ha blocks as well as the six-digit 1ha blocks, you could replace the second parenthesized expression with ([0-9]{4}|[0-9]{2}).
Of course, some of the two-letter blocks are almost completely marine. You could almost certainly ignore OV, for example. NW contains a little bit of Dumfries & Galloway (Portpatrick, NW995545), and a much larger but irrelevant part of Northern Ireland.
I am looking for some guidance for how I could check the pasteboard in iOS for a valid mailing address.
If someone pastes
1234 Apple Street
New York, NY 10011
It parses each part of the string to fill in Address, City, State and Zip. It could be any address and It would be ideal if it could be found inside a longer string.
For example
Meet me at 1234 Apple Street New York, NY 10011 See you there!
Still will parse the correct Address, City, State and Zip.
Any help would be much appreciated!
-Wes
I was a developer at SmartyStreets. We were kind of crazy about street addresses, and street addresses drove me crazy (especially parsing them). It's a two-way street. (Am I done with the street puns?)
First, let's talk about the case where the address is all by itself, because that's easier, albeit still difficult...
Please reference this other question and answer about the very same thing. I also strongly encourage you follow the links to related questions in both the question and the answer. Parsing addresses is a can of worms, but it's not impossible. It's just really hard to do it reliably.
Notice in the answer to that question how many different formats valid addresses can appear in. What guarantees do you have that the user will type it in any of those? And that's only a few. There are others. Consider military, PO box, rural route, and other "special" addresses that don't adhere to the typical format. What about addresses that have a two-or-three-word city name? What about addresses that use a grid system like 100 N 500 E, or secondary numbers like suite, apartment, floor, etc? What about addresses with "1/2", hyphens (as a required punctuation), etc? Addresses missing zip codes or city/state?
All of these and more could be valid. And that's only for US addresses.
If all your addresses, or even most of them (which isn't the case), came in the form like you proposed above, as an example:
[Primary Number] [Street Name] [Any of these street suffixes]
[City Name Followed by a Comma], [State Abbreviation] [5-digit ZIP code]
Then this would be quite easy. Wouldn't that be nice?
You could try to write a regular expression like this guy or that guy, but that only works if addresses are a regular language. They're not regular, and regular expressions are not the answer.
There are a few services which can do this for you because they have a master list (kind of), and the software has to meet rigorous certification standards.
Obviously, since I work at SmartyStreets, I'm prone to suggest starting your search for an answer there. You can try some freeform addresses on the homepage (just fill out the "Street" field). But be aware of a few things that will probably always be an issue. LiveAddress API will be able to parse street addresses for you, most of the time. Shop around, but this should give you an idea.
Now your second question: extract a street address from a string of text. This has been extensively covered elsewhere on S.O. and the interwebs, so I won't go into a lot of detail. Basically, to do this reliably, you'll probably need some Natural Language Processing and human interaction to confirm or correct the best guess.
Don't ever assume these things about un-standardized addresses:
Starts with a number
Ends with a number
Everything between the two numbers is an address
Has a ZIP code present
No more than 2 numbers will be in an address
It's unambiguous
It exists
A street suffix will always be present
It's spelled correctly
...etc.
Again, refer to some other linked posts about this issue. You can make guesses, but always always always have a human confirm the guess if you do that. (Some Mac apps do this. If they detect an address, it will get highlighted, and you can add that address to your contacts. Unfortunately I've seen false positives a lot, and it also misses them a lot.)
Good luck!
I also work at SmartyStreets, and since I'm not a developer I'm not bound by any constraints such as "it can't be done" or "there's no way to do it reliably". In fact the ideas that I come up with may not even always be possible, but, I'm a problem-solver, a solution-finder, and this particular problem absolutely has a solution.
You'll need the following: a little regex, knowledge of a scripting language (python, php, whatever you prefer) and access to an address validation tool (this is required so that you know when you get it right).
So, let's start with the example sentence:
Meet me at 1234 Apple Street New York, NY 10011 See you there!
We can be sure that every address has a beginning and an end. (you can take that to the bank!)
So, if you run a regular expression that looks for the beginning of the address within the string you can eliminate everything before the address begins. Here's a regex that will do just that:
(^(.*(?=p\.?o\.? box|h\.?c\.?r\.? |c\.?m\.?r\.?)|^[^0-9]+))
This will give you back the following:
1234 Apple Street New York, NY 10011 See you there!
Now, you're halfway there but you'll need to loop through the remaining string. Another assumption that you can certainly make is that an address will never be longer than 328 charachters long (I made up that number, but you get the picture. An address has to have an end as well and you can shorten the string by determining the max acceptable USPS address length.)
You're going to loop through the address string until you get a valid address out of it. To do this, start at the beginning and move one word to the right with each additional permutation. This is where the address validation service come in handy, because you have no idea where the address ends and that's what you need to know. So, each permutation you generate from the string (remember, you're starting from the left side) will be sent for validation. Since no valid address can have fewer than two words, You'll start there. Here are the permutations from the example address as well as the validation results (I'm trying each address by entering it in the address line of the address search box on smartystreets.com:
1234 Apple ==> fail
1234 Apple Street ==> fail
1234 Apple Street New ==> fail
1234 Apple Street New York ==> fail
1234 Apple Street New York, NY ==> Bingo, valid address match. No need to keep iterating.
Now, obviously this is not a valid address but you can try the same thing with a real address and you'll get the same results. Obviously this isn't the most sophisticated method to extract a valid address from a string but it certainly works. And, since SmartyStreets allows you to send up to 100 addresses per query, you could permute the address string up to 99 times and get the results back in under 300ms. This won't work with every address, as you'll certainly find out, but it can very easily handle a large majority of them, regardless of how obscured the address is within the text string.
So, we started with this meet me at 1234 Apple Street New York, NY 10011 See you there! and within less than half a second came up with this 1234 Apple Street New York, NY 10011-1000.
Pretty cool huh? It even sounds really easy coming from a non-programmer.
Let's try it with a real address:
Meet me at 4219 jon young orlando fl 32839 See you there!
Apply regex and you get:
4219 jon young orlando fl 32839 See you there!
Permute, iterate, validate:
4219 jon ==> fail
4219 jon young ==> fail
4219 jon young orlando ==> fail
4219 jon young orlando fl ==> Bingo, valid address match.
Can anyone point me to a list of international postal / residential / delivery address format templates that use some kind of parseable standard vocabulary for address parts?
The ideal list contains a country code then a format using replaceable tokens so I can substitute database address fields into a template to produce something printable in the local format.
for example
NZ | [first_name] [family_name]\n[company_name]\n[street_address]\n[city] [post_code]\n[country]
AU | [first_name] [family_name]\n[company_name]\n[street_address]\n[city]\n[state] [post_code]\n[country]
US | etc
UK | etc
Background: I used to have a simple freetext field to accept addresses. Moving to support vCard download, which requires addresses to be broken down into specific fields. Thats all fine: we can do the migration. I'm looking for a way to display the fields in the "correct" order for each country. thanks for your help!
This MSDN page has the information in the format you need and seems accurate, but covers only 33 countries. Maybe they are enough.
The Universal Postal Union offers all the information you need for a lot of countries here. This is top quality information; however, it is split across as many PDF documents as there are countries and is not in the format you need.
This page provides the information in a slightly more accessible form. As far as I can judge, it is accurate (and contains a lot of valuable info), but I can't speak to its quality nor its currentness.
Google have a JSON-based API that they use for their Android address input field library that contains this kind of formatting information.
The field you'd be interested in is fmt. There doesn't seem to be any formal documentation on the format they use, but a proposal to include this information as part of the Unicode CLDR has matching fields (scroll down to "Detailed Breakdown of elements"); there are also some clues in Google's libaddressinput source code.
We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...
What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
Are web browsers, translation tools and translators familiar with these numeric codes in addition to the more common "es" or "es-ES"?
I've already visited the following pages:
W3C Choosing a Language Tag
W3C Language tags in HTML and XML
RFC 5646 Tags for Identifying Languages
Microsoft National Language Support (NLS) API Reference
I doubt that many products like that exist. It seems that some main stream programming languages (I have tested C# and Java) does not support these tags, therefore it would be quite hard to develop programs that does so.
BTW. NLS API Reference that you have provided, does not contain region tag for any of the LCID definition. And if you think of it for the moment, knowing how Locale Identifier is built, there is no way to support it now, actually. Implementation change would be required (they should use some reserved bits, I suppose).
I don't think we will see support for region tags in foreseeable future.
Edit
I saw that Microsoft assigned LCID of value -1 and -2 to "European Union 1" and "European Union 2" respectively. However I don't think it is related.