How do I check whether a given string is a valid geographical location or not? - geolocation

I have a list of strings (noun phrases) and I want to filter out all valid geographical locations from them. Most of these (unwanted location names) are country or city or state names. What would be a way to do this? Is there any open-source lookup table available which contains all country, states, cities of the world?
Example desired output:
TREC4: false, Vienna: true, Ministry: false, IBM: false, Montreal: true, Singapore: true
Unlike this post: Verify user input location string is a valid geographic location?
I have a high number of strings like these (~0.7 million) so google geolocation API is probably not an option for me.

You can use geoplanet data by Yahoo, or geonames data by geonames.org.
Here is a link to geoplanet TSV file containing 5 million geographical places of the world :
https://developer.yahoo.com/geo/geoplanet/data/
Moreover, geoplanet data will provide you type ( city,country,suburb etc) of the geographical place, along with a unique id.
https://developer.yahoo.com/geo/geoplanet/guide/concepts.html
You can do a lowercase, sanitized ( e.g. remove special characters and other anomalies) match of your needle string to the names present in this data.
If you do not want full file scans, first processing this data to store it in a fast lookup database like mongodb or redis will be beneficial.

I can suggest the following three options:
a) Using the Alchemy API: http://www.alchemyapi.com/
If you try their demo, places like France, Honolulu give the entity type as Country or City
b) Using TAGME: http://tagme.di.unipi.it/
TAGME connects every entity in a given text to the corresponding wikipedia page. Crawl the wikipedia page and check the infobox and filter
c) Using Wikipedia Miner: I was unable to find relevant links for this. However, this also works like TAGME.
Suggest you to try all three and do majority voting for each instance.

Related

Get random address/coordinates in a specified town

Is there any way to give Google Maps API or a similar API a town name and have it return a random address inside the town? I was hoping to be able to get the data as a JSON so I could parse it with SwiftyJSON in XCode and use it, but I can't seem to find any way to get the address in the first place. If coordinates would be easier to get, then those would work too, as long as its random and inside the town borders.
You can try to use Google Places API Web Service. It allows you to query for place information on a variety of categories, such as: establishments, prominent points of interest, geographic locations, and more. You can search for places either by proximity or a text string. A Place Search returns a list of places along with summary information about each place.
A Nearby Search lets you search for places within a specified area. You can refine your search request by supplying keywords or specifying the type of place you are searching for.
A Nearby Search request is an HTTP URL of the following form:
https://maps.googleapis.com/maps/api/place/nearbysearch/output?parameters
where output may be either xml or JSON values.
And if you want either address or coordinates, you can use Geocoding for it. Here i found a tutorial on how to use Geocoding in IOS.

Matching MapKit Places with Facebook Places

I am saving photos with city names to server in my application. Firstly, I am getting city names with MapKit, by using latitude and longitude, and then saving photo and city name to database. Later when user want to search a photo, he/she writes the city name and I use autocomplete with Facebook Places (Graph API).
The problem is Facebook Places and MapKit might have different names (spelling). Even they are both in English. I am wondering how to query from my own server which have MapKit cities in it, with Facebook Places cities.
I assume it a is little bit more complicated as it seems first time. Until Facebook, Apple are not using the same data source for their city names it will be hard to find the cities where the name is not exactly the same if you are using the "raw" string, that you get from the FB places.
Maybe there is a much easier way to achieve it, but my first attempts would cover these options:
Save the geo points when you upload the photo, then find a library, API etc.. that returns you a latitude longtitude data based on the Facebook city name and then use this to query the closest result in your database (based on photo location)
2.
Suppose the user typed in a city name and you have a string value (call it rawCity) with the desired city name. Now rawCity should be contained in or be equal to the string that represents the city's Mapkit name.
Let's assign rawCity to a new string called searchStringCity and remove white spaces from it and make the whole string lowercase (non-ascii chars can make some trouble too).
Now you have two strings that should be added to a dictionary: /Pseudo code/
rawCity = Sample City Name
searchStringCity = yoursamplecity
fbCityDictionary = {rawName:rawCity, searchString:searchStringCity}
After you have the fbCityDictionary you're ready with the Facebook part.
As a second step you need some database related work, so next I would create a searchString column in my database and fill it with the "standarized"(remove whitespaces,uppercases,charachter coding stuff) name of the Mapkit type city name.
Now you can write a query where a db item's searchString value is equal to fbCityDictionary[searchString]. However it won't perfectly solve your problem, it will work when a whitespace or a lower/uppercase letter was the problem, but there are a lot of city names that doesn't has an english version and they can be much different in different map databases.
So you will be good for example cases like these:
Facebook version:
Sample City Name ---> samplecityname
Mapkit version:
Samplecity Name ---> samplecityname
These solutions can improve the results, but I would be curious to hear a better solution.

YQL documentation for the google.news search and the "geo" key

Someone know some documentation of Yql Google News Search? I am trying understand the "geo" key values for the search.
This link show a example for the search.
Thanks and sorry for my bad english.
Cleber.
For details of the usage of the different keys on the YQL google.news table, see the source API's documentation.
In this case that can be found in the Google News Search API - JSON Developer's Guide, and the geo key is described as:
This optional argument tells the News Search system to scope search results to a particular location. With this argument present, the query argument (q) becomes optional. You must supply either a city, state, province, country, or zip code as in geo=Santa%20Barbara or geo=British%20Columbia or geo=Peru or geo=93108.
It goes on to say:
When using the geo property, please note the following:
Make sure the location you supply exists within the scope of your chosen news edition. For example, if you specify geo=Quebec for the Canadian edition of Google news, you probably won't get good results.
You can't combine geo with the topic property.
Some editions of News Search don't support the geo parameter. To test if geo works with a specific edition,
Go to that edition's landing page (for example, news.google.ca)
Click Add a Section.
In the Add a Local Section box on the right side of the page, enter a search query relevant to your desired location (for example, Quebec). You should now see a Local Results pane on the edition homepage.
If the Local Results pane is populated with results, you can use the geo parameter for that region.

Lookup telephone area code by latitude and longitude

Looking for a way to get a list of telephone area codes for a given latitude and longitude (and if necessary a given intl. code.) Note, I'm not talking about international dialing prefixes but the area codes within them.
For example, Denver Colorado is covered by the area codes 303 and 720. It's at 39.739 -104.985 and is in NANP 1. So given 39.739,-104.985,1 I'd like to get back [303,720].
Libraries, web services, DB's, or raw data that needs to be parsed into a DB, e.g., a web page of shape points, are all fine and the more global coverage the better, but just NANP 1 would be a great help.
Note I already use MaxMind and could turn the lat-lng into a fake IP and use that as the lookup key, but MaxMind claims only U.S. area codes (whether they truly mean U.S. or actually NANP I haven't tested) and seemingly only 1 per location (e.g. just 303 for Denver.) So it's a possibility, just not a great one.
UPDATE: I found some more relevant information, but no definitive solutions so I'm listing it here rather than in an answer:
I was able to find two U.S. databases http://www.area-codes.com/area-code-database.asp and http://www.nationalnanpa.com/area_codes/index.html (50% down the page, MS Access file.) The former includes lat/lng for $450 and the latter would require nearest-neighbor matching as KeithS talks about (it's probably the same DB underlying the NANPA City Query he found.)
Additionally I found information that implies Teleatlas has area code boundary maps and that ESRI includes area code shape files with copies of ArcGIS. Maponics seems to have data available: there's a Google Maps implementation of Maponics' data at http://www.usnaviguide.com/areacode.htm.
Wow. You'll definitely need some sort of pre-existing database of points. My first thought was ZIPList5 Geocode. It includes lat-long data for each active U.S. ZIP code, so you can throw this data in a DB table, index the hell out of it, and search by just about any geographic info you'd have access to. You can buy one copy for $40, with enterprise-level use for $100. Only problem is that this DB has only the "primary" area code for each ZIP code, so metro areas that have more than one (Dallas, Chicago, NYC) aren't going to show all of them.
You could try a two-pronged approach with some free data I found: for a given latitude and longitude, do a nearest-neighbors search of the data in the USGS Geographic Names Information System; it includes information on every human habitation center, and every named landmark feature, with lat/long coordinates of their centers. You now have your lat/long point mapped to the nearest town/city, ZIP code, county, and state. Now, you can compare that against this list of U.S. Area Codes, to find area codes matching any or all of the identifying information from the USGS. This is all free, and will eventually get you what you need, but you'll probably have to do some work to "massage" the two sets of data into something you can efficiently cross-reference, and/or you'll need to implement a good "search engine" that will accurately find nearest-neighbor named points, and then find area codes for locations matching the names.
One more thing to look at is NANPA, which administers area code assignment to begin with. I'm sure they have a more comprehensive downloadable DB, but the only free public access I could find was this search page, which will find area codes for any city with >20k people. You could turn your lat/long data into a city and state, and then hit this search page: NANPA City Query
Here is an option:
http://geocoder.ca/39.739,-104.985?geoit=xml
<TimeZone>America/Denver</TimeZone>
<AreaCode>720,303</AreaCode

User input parsing - city / state / zipcode / country

I'm looking for advice on parsing input from a user in multiple combinations of City / State / Zip Code / Country.
A common example would be what Google maps does.
Some examples of input would be:
"City, State, Country"
"City, Country"
"City, Zip Code, Country"
"City, State, Zip Code"
"Zip Code"
What would be an efficient and correct way to parse this input from a user?
If you are aware of any example implementations please share :)
The first step would be to break up the text into individual tokens using spaces or commas as the delimiting characters. For scalability, you can then hand each token to a thread or server (if using a Map-Reducer like architecture) to figure out what each token is. For instance,
If we have numbers in the pattern, then it's probably a zip code.
Is the item in the list of known states?
Countries are also fairly easy to handle like states, there's a limited number.
What order are the tokens in compared to the common ways of writing an address? Most input will probably follow the local post office custom for address formats.
Once you have the individual token results, you can glue the parts back together to get a full address. In the cases where there are questions, you can prompt the user what they really meant (like Google maps) and add that information to a learned list.
The easiest method to add that support to an applications, assuming you're not trying to build a map system, is to query Google or Yahoo and ask them to parse the date for you.
I am myself very fascinated with how Google handles that. I do not remember seeing anything similar anywhere else.
I believe, you try to separate an input string in words trying various delimeters - space, comma, semicolon etc. Then you have several combinations. For each combination, you take each words and match it against country, city, town, postal code database. Then you define some metric on how to evaluate the group match result for each combination. Here should also be cross rules, like if the postal code does not match well, but country, city, town match well and in combination refer to a valid address then the metric yields a high mark.
It is sure difficult and not an evening code exercise. It also requires strong computational resources - a shared hosting would probably crack under just 10 requests, but a data center could serve it well.
Not sure if there is an example implementation. Many geographical services are offered on paid basis. Something that sophisticated as GoogleMaps would likely cost a fortune.
Correct me if I'm wrong.
I found a simple PHP implementation
http://www.eotz.com/2008/07/parsing-location-string-php/
Yahoo seems to have a webservice that offers the functionality (sort of)
http://developer.yahoo.com/geo/placemaker/
Openstreetmap seems to offer the same search functionality on its homepage
http://www.openstreetmap.org/
Assuming you're only dealing with those four fields (City Zip State Country), there are finite values for all fields except for City, and even that I guess if you have a big city list is also finite. So just split each field by comma then check against each field list.
Assuming we're talking US addresses-
Zip is most obvious, so check for
that first.
State has 50x2 options
(California or CA), check that next
Country has ~190x2 options, depending
on how encompassing you want to be
(US, United States, USA).
Whatever is left over is probably your City.
As far as efficiency goes, it might make sense to check a handful of 'standard' formats first, like Dan suggests.

Resources