Web Service for Geo Location (Get biggest Cities within an state) - geolocation

Is there are (web) service which offers a Geo Location API?
For example:
I have the German State "Baden-Württemberg", now I want to get a result which are the biggest city's (for example order by population).
My problem is little bit abstract, but hope someone can understand it.

This is not exactly what you are looking for, but I think it is a step in the direction if you are willing to setup your own database to query. The United Nations Statistical Division (UNSD) keeps a dataset of the largest cities > 100,000 population. You can find it at this link. Note that it does not show what state (1st level administrative division) the city is within, just the country.
http://unstats.un.org/unsd/demographic/products/dyb/dyb2011/Table08.xls
I have created a CSV version of the data (using semi-colons as delims) you can use as well:
http://www.opengeocode.org/cude1.1/UN/UNSD/dyd2011-pop100k.zip
OpenGeoCode.Org is an open data project where we take national and international publicly available datasets and convert them into a common CSV format.
Andrew

Related

What is the best way to name a URL when you can do both 'unique identifer' based as well as 'hierarchy based' resource naming ? Pros and cons of each

Lets say we have a resource structure like below
GUID
Region
Country
State
StateDetails
a120c850-e296-4563-8fb9-31d0192aef75
EMEA
FR
Normandy
Statedetails
6f4b3ca6-c992-42dd-b1e3-8c8f8ba62886
APAC
AU
New South Wales
Statedetails
d202b255-5fe1-4203-b4ad-3cc74f4f6986
AMERICAS
US
California
Statedetails
...etc
...etc
...etc
...etc
...etc
Where guid is a unique identifier for a resource record
and Region is a parent for country which in turn is a parent for state and each such state has one record in the table. (region/country/state combination is unique and can act as an alternate key)
In order to display the State details json, which URL naming would be more appropriate. Are there pros and cons to each approach ?
Option 1:
http://www.xyzsamplecompany.com/insertGUIDhere
eg: http://www.xyzsamplecompany.com/6f4b3ca6-c992-42dd-b1e3-8c8f8ba62886
(or)
Option 2:
http://www.xyzsamplecompany.com/region/country/state/resource
eg: http://www.xyzsamplecompany.com/EMEA/FR/Normandy/statedetails
Listing my thoughts here.
In Option 1,
the link is immutable and relatively static. So the same URL will return same data over a long period of time. So a client (requestor) may be provided the static link which they can book mark for future use. Tomorrow if the data changes (say france decides to rename normandy to normândy) such complexity is hidden from the requestor.
However there is less transparency as to what resource we are inquiring and the complexity is hidden.
In Option 2, the hierarchy is very clearly laid out. The complexity of arriving at the correct URL is left to the client (requestor) so they need to keep track of underlying data changes.
It is transparent for any system inspecting the resource such as a monitoring tool or WAF. However with excess transparency, if an unwanted third party knows the list of states, countries, regions (which is common knowledge) then there is risk of scraping which could prove resource intensive.

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...
[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!
You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Complex fuzzy string matching in iOS

I'm writing an iOS application that pulls events from a public Google calendar, pulls out the free-form "Location" field, and drops a pin on a map corresponding to the given location. I want to make the app as flexible as possible using some kind of string search or fuzzy matching algorithms, but I'm not sure where to begin.
There are several things a calendar moderator may enter into the Location field:
A building name and room number (e.g. Foo Hall Room 123)
A building abbreviation and room number (e.g. FOO 123)
A shorthand room or location name (e.g. Foo)
Currently, I have a sqlite database composed of one table, each row storing a latitude, longitude, full building name (Foo Hall), and standardized building abbreviation (FOO).
I want to take the moderator's free-form string and obtain the correct coordinates from the database (if present).
I've tried using LIKE '%FOO%' and similar patterns, as well as Levenshtein Distance, but I run into issues, for instance if the actual building name is "Example Foo and Bar Building" and the location entered by moderator is "Example Bar Building".
The three options I've considered are...
Force the moderator to enter in a standardized abbreviation or building name. This could potentially be a tedious process for the calendar moderators, so I'm trying to avoid this if possible.
Do a crude substring search that checks if the entered string is contained anywhere in the database string. This is what my university does on their website, but it obviously isn't very flexible.
Implement a more complex fuzzy string matching algorithm that provides maximum flexibility but will take an order of magnitude more time to implement. If the right one already exists, that would be the ideal solution!!
Which of these options (if any) seems the best? Is there a better alternative that I haven't thought of? Is there a library that does what I need and I just haven't found it yet?
Thanks in advance for any help!

What's the best way to lookup the US county a US city resides in?

I'm looking for the best/easiest way to programmatically grab the name of the US county a given US city resides in. It doesn't seem there's a straightforward API available for such a (seemingly simple) task?
You can download a freely-available database of county/city/zip code info such as this one:
http://www.unitedstateszipcodes.org/zip-code-database/ (no need to register or pay)
Import it whole, or a subsection of it, into a local, persistent data store (such as a database) and query it whenever you need to look up a city's county
Note: County info has disappeared from the originally-linked .csv file since this answer was posted.
This link no longer contains county information: http://federalgovernmentzipcodes.us/free-zipcode-database.csv
1) Cities span counties
2) Zips span both cities and counties, not even on the same lines
Any solution that uses zip as an intermediary is going to corrupt your data (and no, "zip+4" won't usually fix it). You will find that a city-to-zip-to-county data map (#2) has a larger number of city-to-county matches than the more accurate model (#1)--these are all bad matches.
What you're looking for is free census data. The Federal Information Processing Standards (FIPS) dataset you need is called "2010 ANSI Codes for Places": https://www.census.gov/geographies/reference-files/time-series/geo/name-lookup-tables.2010.html
Census "places" are the "cities" for our question. These files map "places" to one or more county.
It will not be easy to use geospace functions for this task because of the odd polygon shaped of counties and the point locations of cities.
Your best bet is to reference a database of cities and their respective counties, though I don't know where you could find one.
Maybe Texas publishes one?
CommonDataHub doesn't contain this information.
Here is a bit of code to programmatically grab the name of a US county given a single US city/state using the Google Maps API. This code is slow/inefficient and does not have any error handling. However, it has worked reliably for me to match counties with a list of ~1,000 cities.
#Set up googlemaps API
import googlemaps
google_maps = googlemaps.Client(key='API_KEY_GOES_HERE')
#String of city/state
address_string = 'Atlanta, GA'
#Geocode
location = google_maps.geocode(address_string)
#Loop through the first dictionary within `location` and find the address component that contains the 'administrative_area_level_2' designator, which is the county level
target_string = 'administrative_area_level_2'
for item in location[0]['address_components']:
if target_string in item['types']: #Match target_string
county_name = item['long_name'] #Or 'short_name'
break #Break out once county is located
else:
#Some locations might not contain the expected information
pass
This produces:
>>> county_name
Fulton County
Caveats:
code will break if google_maps.geocode() is not passed a valid
address
certain addresses will not return data corresponding to 'administrative_area_level_2'
this does not solve the problem of US cities that span multiple counties. Instead, I think the API simply returns the county associated with the single latitude/longitude associated with address_string
The quickest and most non-evasive way might be to use a JSON/XML request from a free geolocation API (Easily found on Google). That way you don't need to create/host your own database.

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Resources