User input parsing - city / state / zipcode / country - parsing

I'm looking for advice on parsing input from a user in multiple combinations of City / State / Zip Code / Country.
A common example would be what Google maps does.
Some examples of input would be:
"City, State, Country"
"City, Country"
"City, Zip Code, Country"
"City, State, Zip Code"
"Zip Code"
What would be an efficient and correct way to parse this input from a user?
If you are aware of any example implementations please share :)

The first step would be to break up the text into individual tokens using spaces or commas as the delimiting characters. For scalability, you can then hand each token to a thread or server (if using a Map-Reducer like architecture) to figure out what each token is. For instance,
If we have numbers in the pattern, then it's probably a zip code.
Is the item in the list of known states?
Countries are also fairly easy to handle like states, there's a limited number.
What order are the tokens in compared to the common ways of writing an address? Most input will probably follow the local post office custom for address formats.
Once you have the individual token results, you can glue the parts back together to get a full address. In the cases where there are questions, you can prompt the user what they really meant (like Google maps) and add that information to a learned list.
The easiest method to add that support to an applications, assuming you're not trying to build a map system, is to query Google or Yahoo and ask them to parse the date for you.

I am myself very fascinated with how Google handles that. I do not remember seeing anything similar anywhere else.
I believe, you try to separate an input string in words trying various delimeters - space, comma, semicolon etc. Then you have several combinations. For each combination, you take each words and match it against country, city, town, postal code database. Then you define some metric on how to evaluate the group match result for each combination. Here should also be cross rules, like if the postal code does not match well, but country, city, town match well and in combination refer to a valid address then the metric yields a high mark.
It is sure difficult and not an evening code exercise. It also requires strong computational resources - a shared hosting would probably crack under just 10 requests, but a data center could serve it well.
Not sure if there is an example implementation. Many geographical services are offered on paid basis. Something that sophisticated as GoogleMaps would likely cost a fortune.
Correct me if I'm wrong.

I found a simple PHP implementation
http://www.eotz.com/2008/07/parsing-location-string-php/
Yahoo seems to have a webservice that offers the functionality (sort of)
http://developer.yahoo.com/geo/placemaker/
Openstreetmap seems to offer the same search functionality on its homepage
http://www.openstreetmap.org/

Assuming you're only dealing with those four fields (City Zip State Country), there are finite values for all fields except for City, and even that I guess if you have a big city list is also finite. So just split each field by comma then check against each field list.
Assuming we're talking US addresses-
Zip is most obvious, so check for
that first.
State has 50x2 options
(California or CA), check that next
Country has ~190x2 options, depending
on how encompassing you want to be
(US, United States, USA).
Whatever is left over is probably your City.
As far as efficiency goes, it might make sense to check a handful of 'standard' formats first, like Dan suggests.

Related

How do I check whether a given string is a valid geographical location or not?

I have a list of strings (noun phrases) and I want to filter out all valid geographical locations from them. Most of these (unwanted location names) are country or city or state names. What would be a way to do this? Is there any open-source lookup table available which contains all country, states, cities of the world?
Example desired output:
TREC4: false, Vienna: true, Ministry: false, IBM: false, Montreal: true, Singapore: true
Unlike this post: Verify user input location string is a valid geographic location?
I have a high number of strings like these (~0.7 million) so google geolocation API is probably not an option for me.
You can use geoplanet data by Yahoo, or geonames data by geonames.org.
Here is a link to geoplanet TSV file containing 5 million geographical places of the world :
https://developer.yahoo.com/geo/geoplanet/data/
Moreover, geoplanet data will provide you type ( city,country,suburb etc) of the geographical place, along with a unique id.
https://developer.yahoo.com/geo/geoplanet/guide/concepts.html
You can do a lowercase, sanitized ( e.g. remove special characters and other anomalies) match of your needle string to the names present in this data.
If you do not want full file scans, first processing this data to store it in a fast lookup database like mongodb or redis will be beneficial.
I can suggest the following three options:
a) Using the Alchemy API: http://www.alchemyapi.com/
If you try their demo, places like France, Honolulu give the entity type as Country or City
b) Using TAGME: http://tagme.di.unipi.it/
TAGME connects every entity in a given text to the corresponding wikipedia page. Crawl the wikipedia page and check the infobox and filter
c) Using Wikipedia Miner: I was unable to find relevant links for this. However, this also works like TAGME.
Suggest you to try all three and do majority voting for each instance.

use of + sign in Google Adwords

I want to do a phrase match search like "warrenton home values" but I'd like to make sure home+values stay in that order but can be switched so that "home values warrenton" and "warrenton home values" will both trigger.
I thought the + sign would "chain" the two words, home+values together but after a chat with a Google rep I find myself more confused than before. What is the best way to achieve this?
Will this phrase also trigger warrenton island home values keyword search or does the use of quotes only match words found within the quotes? I need to make sure I keep warrenton in the search phrase to avoid wasting budget on triggering ads outside of the geographic area.
You can add a modifier, the plus sign on your keyboard (+), to any of the terms that are part of your broad match keyword phrase. By adding a modifier, your ads can only show when someone's search contains those modified terms, or close variations of the modified terms, in any order. The modifier won't work with phrase match or exact match keywords.
Example: +women's +hats
Example Search: hats for women
Unlike broad match keywords, modified broad match keywords won't show your ad for synonyms or related searches. For this reason, it adds an additional level of control. Using broad match modifier is a good choice if you want to increase relevancy even if it means you might get less ad traffic than broad match.
More information here: https://support.google.com/adwords/answer/2497836?hl=en&authuser=1

Address field validation for iOS / Mac

I want to create an "Add Address" view, a very basic "Street, City, Zip, Country" type of page: multiple text fields inside a table view. This is simple if you only ever added U.S addresses, but I'm not sure about how to do this the right way though, handling all international use-cases as well. Essentially:
1. How do you pick the right field label for each country? For e.g. for US / Australian addresses, the field should be called "State"; for UK, it's called "County", in some places it's called "Province". How do you know what the label should say (short of hard-coding logic myself for each country)?
2. How do you validate the values for those field? UK postal codes have a certain format, whereas in the US it's a 5-digit ZIP code. Also, in the US, there is a list of states that the user can select. How do you get that list?
I've looked into NSLocale, and can't find any way to do this. Surely there must be a good and easy way to do this?
I dug around and in the end the best thing I found was a guide on "The good international address field form", but it'll still be hard to validate it. I don't think it's done.
http://www.uxmatters.com/mt/archives/2008/06/international-address-fields-in-web-forms.php
One method could be to reverse lookup the address through mapkit.
You can try to simplify the UI by adding just one text field and ask user to enter his address in an arbitrary way, and then use CLGeocoder class to convert the string to instance of CLPlacemark, which is a convenient container for such information as country, postal code, etc.

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

Lookup telephone area code by latitude and longitude

Looking for a way to get a list of telephone area codes for a given latitude and longitude (and if necessary a given intl. code.) Note, I'm not talking about international dialing prefixes but the area codes within them.
For example, Denver Colorado is covered by the area codes 303 and 720. It's at 39.739 -104.985 and is in NANP 1. So given 39.739,-104.985,1 I'd like to get back [303,720].
Libraries, web services, DB's, or raw data that needs to be parsed into a DB, e.g., a web page of shape points, are all fine and the more global coverage the better, but just NANP 1 would be a great help.
Note I already use MaxMind and could turn the lat-lng into a fake IP and use that as the lookup key, but MaxMind claims only U.S. area codes (whether they truly mean U.S. or actually NANP I haven't tested) and seemingly only 1 per location (e.g. just 303 for Denver.) So it's a possibility, just not a great one.
UPDATE: I found some more relevant information, but no definitive solutions so I'm listing it here rather than in an answer:
I was able to find two U.S. databases http://www.area-codes.com/area-code-database.asp and http://www.nationalnanpa.com/area_codes/index.html (50% down the page, MS Access file.) The former includes lat/lng for $450 and the latter would require nearest-neighbor matching as KeithS talks about (it's probably the same DB underlying the NANPA City Query he found.)
Additionally I found information that implies Teleatlas has area code boundary maps and that ESRI includes area code shape files with copies of ArcGIS. Maponics seems to have data available: there's a Google Maps implementation of Maponics' data at http://www.usnaviguide.com/areacode.htm.
Wow. You'll definitely need some sort of pre-existing database of points. My first thought was ZIPList5 Geocode. It includes lat-long data for each active U.S. ZIP code, so you can throw this data in a DB table, index the hell out of it, and search by just about any geographic info you'd have access to. You can buy one copy for $40, with enterprise-level use for $100. Only problem is that this DB has only the "primary" area code for each ZIP code, so metro areas that have more than one (Dallas, Chicago, NYC) aren't going to show all of them.
You could try a two-pronged approach with some free data I found: for a given latitude and longitude, do a nearest-neighbors search of the data in the USGS Geographic Names Information System; it includes information on every human habitation center, and every named landmark feature, with lat/long coordinates of their centers. You now have your lat/long point mapped to the nearest town/city, ZIP code, county, and state. Now, you can compare that against this list of U.S. Area Codes, to find area codes matching any or all of the identifying information from the USGS. This is all free, and will eventually get you what you need, but you'll probably have to do some work to "massage" the two sets of data into something you can efficiently cross-reference, and/or you'll need to implement a good "search engine" that will accurately find nearest-neighbor named points, and then find area codes for locations matching the names.
One more thing to look at is NANPA, which administers area code assignment to begin with. I'm sure they have a more comprehensive downloadable DB, but the only free public access I could find was this search page, which will find area codes for any city with >20k people. You could turn your lat/long data into a city and state, and then hit this search page: NANPA City Query
Here is an option:
http://geocoder.ca/39.739,-104.985?geoit=xml
<TimeZone>America/Denver</TimeZone>
<AreaCode>720,303</AreaCode

Resources