What's the best way to lookup the US county a US city resides in? - geolocation

I'm looking for the best/easiest way to programmatically grab the name of the US county a given US city resides in. It doesn't seem there's a straightforward API available for such a (seemingly simple) task?

You can download a freely-available database of county/city/zip code info such as this one:
http://www.unitedstateszipcodes.org/zip-code-database/ (no need to register or pay)
Import it whole, or a subsection of it, into a local, persistent data store (such as a database) and query it whenever you need to look up a city's county
Note: County info has disappeared from the originally-linked .csv file since this answer was posted.
This link no longer contains county information: http://federalgovernmentzipcodes.us/free-zipcode-database.csv

1) Cities span counties
2) Zips span both cities and counties, not even on the same lines
Any solution that uses zip as an intermediary is going to corrupt your data (and no, "zip+4" won't usually fix it). You will find that a city-to-zip-to-county data map (#2) has a larger number of city-to-county matches than the more accurate model (#1)--these are all bad matches.
What you're looking for is free census data. The Federal Information Processing Standards (FIPS) dataset you need is called "2010 ANSI Codes for Places": https://www.census.gov/geographies/reference-files/time-series/geo/name-lookup-tables.2010.html
Census "places" are the "cities" for our question. These files map "places" to one or more county.

It will not be easy to use geospace functions for this task because of the odd polygon shaped of counties and the point locations of cities.
Your best bet is to reference a database of cities and their respective counties, though I don't know where you could find one.
Maybe Texas publishes one?
CommonDataHub doesn't contain this information.

Here is a bit of code to programmatically grab the name of a US county given a single US city/state using the Google Maps API. This code is slow/inefficient and does not have any error handling. However, it has worked reliably for me to match counties with a list of ~1,000 cities.
#Set up googlemaps API
import googlemaps
google_maps = googlemaps.Client(key='API_KEY_GOES_HERE')
#String of city/state
address_string = 'Atlanta, GA'
#Geocode
location = google_maps.geocode(address_string)
#Loop through the first dictionary within `location` and find the address component that contains the 'administrative_area_level_2' designator, which is the county level
target_string = 'administrative_area_level_2'
for item in location[0]['address_components']:
if target_string in item['types']: #Match target_string
county_name = item['long_name'] #Or 'short_name'
break #Break out once county is located
else:
#Some locations might not contain the expected information
pass
This produces:
>>> county_name
Fulton County
Caveats:
code will break if google_maps.geocode() is not passed a valid
address
certain addresses will not return data corresponding to 'administrative_area_level_2'
this does not solve the problem of US cities that span multiple counties. Instead, I think the API simply returns the county associated with the single latitude/longitude associated with address_string

The quickest and most non-evasive way might be to use a JSON/XML request from a free geolocation API (Easily found on Google). That way you don't need to create/host your own database.

Related

How to standardize city names inserted by user

I need to write a small ETL pipeline because I need to move some data from a source database to a target database (a datawarehouse) to perform some analysis on data.
Among those data, I need to clean and conform the name of cities. Cities are inserted manually by international users, conseguently for a single city I can have multiple names (for example London or Londra).
In my source database I do not have only big cities but I have also small villages.
Well, if I do not standardize city names, our analysis could be nonsensical.
Which is the best practices to standardize cities in my target database? Have any idea or suggestion I can undertake?
Thank you
The only reliable way to do this is to use commercial address validation software - preferably in your source system when the data is being created but it could be integrated into your data pipeline processes.
Assuming you can't afford/justify the use of commercial software, the only other solution is to create your own translation table i.e. a table that holds the values that are entered and what value you want them to be translated to.
While you can build this table based on historic data, there will always be new values that are not in the table, so you would need a process to identify these, add the new record to your translation data and then fix the affected records. You would also need to accept that there would be un-cleansed data in your warehouse for a period of time after each data load

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...
[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!
You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Web Service for Geo Location (Get biggest Cities within an state)

Is there are (web) service which offers a Geo Location API?
For example:
I have the German State "Baden-Württemberg", now I want to get a result which are the biggest city's (for example order by population).
My problem is little bit abstract, but hope someone can understand it.
This is not exactly what you are looking for, but I think it is a step in the direction if you are willing to setup your own database to query. The United Nations Statistical Division (UNSD) keeps a dataset of the largest cities > 100,000 population. You can find it at this link. Note that it does not show what state (1st level administrative division) the city is within, just the country.
http://unstats.un.org/unsd/demographic/products/dyb/dyb2011/Table08.xls
I have created a CSV version of the data (using semi-colons as delims) you can use as well:
http://www.opengeocode.org/cude1.1/UN/UNSD/dyd2011-pop100k.zip
OpenGeoCode.Org is an open data project where we take national and international publicly available datasets and convert them into a common CSV format.
Andrew

How the country code of a venue is generated?

I'm developing an APP using FS API and I need to geolocalize a venue. Nothing particularly difficult, just "US" or "Non-US".
I though of using the country code (CC fields) which come with every venue object, but I'm not sure how this country code is calculated:
a) Is something you infer using the lat and lon of the venue, and therefore something FS calculates directly with Geospatial queries?
b) Is something the user insert manually (not only the CC but the country itself) and therefore is something which can be missing of be mispelled / misinserted?
Cheers,
Alfredo
The country information is built up from a variety of sources, including geolocation and user input. The input is validated though, so there should not be any invalid country codes. You can expect that the country code will be an accurate representation of where the venue is located.

Complex fuzzy string matching in iOS

I'm writing an iOS application that pulls events from a public Google calendar, pulls out the free-form "Location" field, and drops a pin on a map corresponding to the given location. I want to make the app as flexible as possible using some kind of string search or fuzzy matching algorithms, but I'm not sure where to begin.
There are several things a calendar moderator may enter into the Location field:
A building name and room number (e.g. Foo Hall Room 123)
A building abbreviation and room number (e.g. FOO 123)
A shorthand room or location name (e.g. Foo)
Currently, I have a sqlite database composed of one table, each row storing a latitude, longitude, full building name (Foo Hall), and standardized building abbreviation (FOO).
I want to take the moderator's free-form string and obtain the correct coordinates from the database (if present).
I've tried using LIKE '%FOO%' and similar patterns, as well as Levenshtein Distance, but I run into issues, for instance if the actual building name is "Example Foo and Bar Building" and the location entered by moderator is "Example Bar Building".
The three options I've considered are...
Force the moderator to enter in a standardized abbreviation or building name. This could potentially be a tedious process for the calendar moderators, so I'm trying to avoid this if possible.
Do a crude substring search that checks if the entered string is contained anywhere in the database string. This is what my university does on their website, but it obviously isn't very flexible.
Implement a more complex fuzzy string matching algorithm that provides maximum flexibility but will take an order of magnitude more time to implement. If the right one already exists, that would be the ideal solution!!
Which of these options (if any) seems the best? Is there a better alternative that I haven't thought of? Is there a library that does what I need and I just haven't found it yet?
Thanks in advance for any help!

Resources