Detect/Parse Mailing Addresses in Text - parsing

Are there any open source/commercial libraries out there that can detect mailing addresses in text, just like how Apple's Mail app underlines addresses on the Mac/iPhone.
I've been doing a little online research and the ideas seem to be either to use Google, Regex or a full on NLP package such as Stanford's NLP, which usually are pretty massive. I doubt iPhone has a 500MB NLP package in there, or connects to Google every time you read an email. Which makes me to believe there should be an easier way. Too bad UIDataDetectors is not open source.
I know this question has been asked before, but there were no conclusive answers, so here's my try.

As for Python you can try Pyap:
https://pypi.python.org/pypi/pyap
It currently supports US and Canadian addresses

Parsing addresses isn't a science. At my office we have been dealing with address parsing for years and the problem is that there aren't any rules about what constitutes a valid address. We use the USPS address database for cleaning addresses which is actually pretty fast and way more accurate than we were ever able to get on our own. It gets us 98% accuracy where as before we got about 90% cleaned addresses.
The bigger problem with address parsing tends to be that people don't input the address the same way. The same address might be in all the following forms.
128 E Beaumont St
128 East Beaumont Street
128 E Bmt St
128 Beaumont Street
128 Highway 88
The third one looks totally wrong but people will type that sometimes. Sometimes a street is also a highway. There are a bunch of possibilities. Just try to catch 90% and you accept that is as good as it gets for address parsing.

Extractiv provides commercial NLP powered by Language Computer Corporation that can parse entities and relations in either uploaded documents or from web crawls. The former service utilizes a REST API. I dropped this URL in, and it extracts 4/5 of the addresses. Note, having them strung like that together makes them especially difficult.
Search for "address" in this JSON output:
http://rest.extractiv.com/extractiv/?url=https://stackoverflow.com/questions/5099684/detect-parse-mailing-addresses-in-text&output_format=json
One of them:
{
"id": 11,
"len": 17,
"offset": 1557,
"text": "128 E Beaumont St",
"type": "ADDRESS"
},
(Note: if you use the HTML output, which is more for demos, it filters out non-sentence content, which is why I showed the JSON instead).
Disclaimer: I work at Extractiv.
Update:
Extractiv is no more.

You can actually get extremely high accuracy as Drew mentioned by extracting the addresses and then comparing them against the USPS data. Getting a DVD from the USPS yearly will certainly work but doesn't factor in the addresses that change. For that, you would want a more up-to-date version. The USPS publishes it's updated address data (in proprietary format) monthly so that would be a good source of authoritative addresses.
On top of that, using an address validation service (after you extract the address data) will standardize the addresses for you and then check them for deliverability and/or vacancy status. As Drew mentioned, the same address can be written in many different ways that still work. However, the USPS will always use the standardized format.
In order to do what you are looking for programmatically, you'll definitely want an API, although list processing services are also available.
SmartyStreets has a free address validation API called LiveAddress that will standardize, verify, and then validate any US postal address. In the interest of full disclosure, I'm the founder of SmartyStreets.

Related

Maestro Credit Card: Pulling information from MSR dump (Any language)

We have a system that allows you to scan your credit card on a MSR and from the dump I pull the needed fields such as name/cc/exp. Recently we had to add globalized credit cards to this. For almost all of the card provided, I was able to still pull the information since they seemed to all follow a standard. One exception however was a Maestro card. The format is completely different, and since I neither have one to verify actual number on card vs dumped data, nore have access to any other dumps, it's very hard for me to figure out the correct format of these. I also did some google searching with little luck of extracting data from a MSR dump.
Unlike almost all other cards, track one does not start with "%B" and Track two does not start with ";". Both tracks do appear to end with "?" (based off analyzing the whole dump, not by track). Track 3 does appear to be empty, which is normal.
The whole dump seems to lack any name data and is basically in the format of:
###=###?
###=###=###==#=###?
Note that besides the single #, where I had 3 it was variable length.
Again I only had access to one single dump, which for obvious reason I cannot post here.
If anyone has some example code in any language, or can link me to some help, I'd really appreciate it.
Thanks in advance,
Anthony
Is it possible that the card you are testing is faulty or simply a non standard card that is generally not supported? try to check track data from other maestro cards before assuming your system is at fault.
I say this because ISO 7813, the governing standard for transaction cards is pretty clear regarding the fact that track 2 data begins with start sentinel ";" and that all valid bank cards have a format code "B" following the start sentinel "%" in track 1.
check the standard carefully and make sure your system is parsing correctly:
http://www.gae.ucm.es/~padilla/extrawork/tracks.html

Geolocation, Is it possible to get latitude and longitude from address and store locally in my database

I want to be able to run queries locally comparing latitude and longitude of locations so I can run queries for certain addresses I've captured based on distance.
I found a free database that has this information for zip codes but I want this information for more specific addresses. I've looked at google's geolocation service and it appears it's against the TOS to store these values in my database or to use them for anything other than doing stuff with google maps. (If somebody's looked deeper into this and I'm incorrect let me know)
Am I likely to find any (free or pay) service that will let me store these lat/lon values locally? The number of addresses I need is currently pretty small but if my site becomes popular it could expand quite a bit over time to a large number. I just need to get the coordinates of each address entered once though.
This question hasn't received enough attention...
You're correct -- it can't be done with Google's service and still conform to the TOS. Cheers to you for honestly seeking to comply with the TOS.
I work at a company called SmartyStreets where we process addresses and verify addresses -- and geocode them, too. Google's terms don't allow you to store the data returned from the API, and there's pretty strict usage limits before they throttle or cut off your access.
Screen scraping presents many challenges and problems which are both technical and ethical, and I don't suppose I'll get into them here. The Microsoft library linked to by Giorgio is for .NET only.
If you're still serious about doing this, we have a service called LiveAddress which is accessible from any platform or language. It's a RESTful API which can be called using GET or POST for example, and the output is JSON which is easy to parse in pretty much every common language/platform.
Our terms allow you to store the data you collect as long as you don't re-manufacture our product or build your own database in an attempt to duplicate ours (or something of the like). For what you've described, though, it shouldn't be a problem.
Let me know if you have further questions about address geocoding; I'll be happy to help.
By the way, there's some sample code at our GitHub repo: https://github.com/smartystreets/LiveAddressSamples
http://www.zip-info.com/cgi-local/zipsrch.exe?ll=ll&zip=13206&Go=Go could use a screen scraper if you just need to get them once.
Also Microsoft provides this service. Check if this can help you http://msdn.microsoft.com/en-us/library/cc966913.aspx

Physical Address to GeoLocation UK

Is there a good physical address to GeoLocation conversion database in the UK? I am trying to use this to build a globrix style search box http://www.globrix.com/ for a web application. Any pointers will be nice. I have been searching for hours. I have found several that convert UK Postcodes into Geolocation. But I need the addresses listed as on Globrix.
The Google Maps API provides a geocoder webservice that you can actually use independently of Google Maps itself. You send it the address/postcode, and it responds with a lat/long plus disambiguated addresses. We use it server-side in the UK to do address lookup. It's incredibly quick, too.
http://code.google.com/apis/maps/documentation/geocoding/index.html
http://www.postcodeanywhere.co.uk should be able to help with this. Alternatively, you can buy the "PAF" (Postcode Address File) from the Royal Mail, but it is expensive.
Update for information relating UK geolocations in 2020. Since 2009:
Google's Geocoder has gotten an order of magnitude more expensive in 2018. It's ~0.5c per search with no free tier
Office for National Statistics have released a free postcode directory called ONSPD. This means if you have the postcode of your address, you can resolve a geolocation accurate to the postcode centroid (this may be 10-100m or so out). There's a free public service API available at https://postcodes.io which allows you to forward or reverse geocode a postcode. There are also public docker data and application images which allow you to host this easily
If you're interested in Rooftop accurate geocodes, a change in Ordnance Survey licensing in 2020 has meant its much simpler and cheaper to access geolocations for almost every premise in Great Britain from Ordnance Survey by combining it with Royal Mail PAF (Postcode Address File). As of September 2020, I think https://ideal-postcodes.co.uk is currently the only company to offer complete and authoritative rooftop geolocations under these new rules. It's likely other PAF vendors will catch up over the coming years.
Disclaimer: I'm the author of postcodes.io and work for ideal-postcodes.co.uk

Are there any better geolocation databases / technologies / services or has anyone done any work with improving the accuracy of existing systems?

I am working on integrating geolocation services into a website and the best source of data I've found so far is MaxMind's GeoIP API with GeoLite City data. Even this data seems to often be questionable though. For example, I am located in downtown Palo Alto, but it locates my IP as being in Portola Valley, which is about 7 miles away. Palo Alto has a population of 60k+, whereas Portola Valley has a population of less than 5k. I would think if you see an IP originating somewhere around there it would make more sense to assume it was coming from the highly populated city, not the tiny one. I've also had it locate Palo Alto IPs completely across the country in Kentucky, etc.
Does anyone know of any better sources of data, or any tools/technologies/efforts to improve the accuracy of geolocation efforts? Commercial solutions are fine.
Where an IP comes up at the wrong end of the country, you probably won't find a better match elsewhere because it's probably an ISP that uses one group of IPs for customers in a wide area. My favourite example is trains here in the UK where the on-board wifi is identified as being in Sweden because they use a satellite connection to an ISP in Sweden.
A commercial supplier may be able to afford to spend more time tracking down the hard cases, but in many cases there just won't be a good answer to give you. They may, however, give you a confidence factor to tell you when they're guessing. I've heard good things about Quova, though I've never used them.
Assuming that you've got the best latitude and longitude that you can get (or can afford), then you're left dealing with cases where they pick the closest city rather than a more likely larger city nearby. Unfortunately I don't have the code to hand, but I had some success using the data from geonames to pick a "sensible" city near a point. They list lat/long and population, so you can do something like
ORDER BY ( Distance / LOG( Population ) )
You'd need to experiment with that to get something with the right level of bias towards larger cities, but I had it working quite nicely taking the centre of a Google Maps view and displaying a heading like "Showing results near London..." that changed as you moved the map.
I am not sure if this will help, but here is a site that has done a pretty good job of IP mapping. Maybe you could ask them for help :) seomoz.org
A couple of sites I saw referenced recently for free GeoIP services are
WIPmania
hostip.info

Finding City and Zip Code for a Location

Given a latitude and longitude, what is the easiest way to find the name of the city and the US zip code of that location.
(This is similar to https://stackoverflow.com/questions/23572/latitude-longitude-database, except I want to convert in the opposite direction.)
Related question: Get street address at lat/long pair
Any of the online services mentioned and their competitors offer "reverse geocoding" which does what you ask--convert lon/lat coordinates into a street address of some-sort.
If you only need the zip codes and/or cities, then I would obtain the Zip Code database and urban area database from the US Census Bureau which is FREE (paid for by your tax dollars). http://www.census.gov/geo/www/cob/zt_metadata.html.
From there, you can either come up with your own search algorithm for the spatial data or make use of one of a spatial databases such as Microsoft SQL Server, PostGIS, Oracle Spatial, ArcSDE, etc.
Update: The 2010 Census data can be found at:
http://www2.census.gov/census_2010/
This is the web service to call.
http://developer.yahoo.com/search/local/V2/localSearch.html
This site has ok web services, but not exactly what you're asking for here.
http://www.usps.com/webtools/
You have two main options:
Use a Reverse Geocoding Service
Google's can only be used in conjunction with an embedded Google Map on the same page, so I don't recommend it unless that is what you are doing.
Yahoo has a good one, see http://developer.yahoo.com/search/local/V3/localSearch.html
I've not used OpenStreetMap's. Their maps look very detailed and thorough, and are always getting better, but I'd be worried about latency and reliability, and whether their address data is complete (address data is not directly visible on a map, and OpenStreetMap is primarily an interactive map).
Use a Map of the ZIP Codes
The US Census publishes a map of US ZIP codes here. They build this from their smallest statistical unit, a Census Block, which corresponds to a city block in most cases. For each block, they find what ZIP code is most common on that block (most blocks have only one ZIP code, but blocks near the border between ZIP codes might have more than one). They then aggregate all the blocks with a given ZIP code into a single area called a Zip Code Tabulation Area. They publish a map of those areas in ESRI shapefile format.
I know about this because I wrote a Java Library and web service that (among other things) uses this map to return the ZIP code for a given latitude and longitude. It is a commercial product, so it won't be for everyone, but it is fast, easy to use, and solves this specific problem without an API. You can read about this product here:
http://askgeo.com/database/UsZcta2010
And about all of your geographic offerings here:
http://askgeo.com
Unlike reverse geocoding solutions, which are only available as Web APIs because running your own service would be extremely difficult, you can run this library on your own server and not depend on an external resource.
If you call volume to the service gets up too high, you should definitely consider getting your own set of postal data. In most cases, that will provide all of the information that you need, and there are plenty of db tools for indexing location data (i.e. PostGIS for PostgreSQL).
You can buy a fairly inexpensive subscription to zipcodes with lat and long info here: http://www.zipcodedownload.com/
Or google's reverse geocoding
link
http://maps.google.com/maps/geo?output=xml&q={0},{1}&key={2}&sensor=true&oe=utf8
where 0 is latitude 1 is longitude
geonames has an extensive set of ws that can handle this (among others):
http://www.geonames.org/export/web-services.html#findNearbyPostalCodes
http://www.geonames.org/export/web-services.html#findNearbyPlaceName
Another reverse geocoding provider that hasn't been listed here yet is OpenStreetMap: you can use their Nominatim search service.
OSM has the (potentially?) added bonus of being entirely user editable (wiki-like) and thus having a very liberal licencing scheme of all this data. Think of this of open source map data.

Resources