How to match postal codes to cities name in SPSS? - spss

I have a dataset1 with one variable "postcode" which contains several postal codes. The order of these postal codes is important and could not be changed.
In my second dataset I have 2 variables: "postcode" which contains all postal codes of the country and "city" which contains names of the city that have these postal codes.
My goal: I need to match postal codes from dataset1 with names of the cities from dataset2.
dataset1:
postcode
5226
3071
1821
dataset2:
postcode city
5226 Leiden
3930 Amsterdam
1821 Almere
1921 Echt
3071 Den Bosch
This is the result that I want:
city
Leiden
Den Bosch
Almere

Yes, merging files is the way to go. Make sure the postal code variable has the same name in both files or data sets. Open the one without the city names. Click on Data>Merge Files>Add Variables, identify the file with the cities included, click Continue, and you should see that it'll merge based on key values, which is what you want.

Related

Cleaning data in SPSS with name misspellings

I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.

How can I sort an array into a dictionary?

Let's say I have an array which contain multiple country names such as "Australia, Denmark, United Kingdom, Austria, Australia, Denmark" some of the country names appear twice.
How can I sort them to form a dictionary based on country names. So they key would be the country name and the element would be the country.
If I have two countries in my array that are the same, the key would be the country and the elements would be those two countries.
I need it so that if I add another country it will make a key for the country without having to specify keys beforehand.
Each country needs to be under a key of it's country, not dependent on the occurrences of the country in the array.
I think I've worked out a basic algorithm for it but I can't seem to put it into practice.
While enumerating over the array
check to see if a key in the dictionary matches the current string
If it does, add the string to the dictionary under the matching key
If it doesn't create a key and place the string under the key.
Is this algorithm correct or at least a step in the right direction?
Thanks for the help.
EDIT:
We have an array which contains the country names "Australia, Denmark, United Kingdom, Austria, Australia, Denmark"
I need to organise this into a dictionary based on countries so as we have two of the country Denmark in the array I need to sort it so it looks like this:
Denmark: "Denmark", "Denmark"
The key is the country name and the element is the string.
United Kingdom only occurs once so that part of the dictionary will look like this:
United Kingdom: "United Kingdom"
I hope I've made more sense.
Not sure if this is what you meant. It's not very clear.
var dict = [String: [String]]()
let countries = ["Holland", "England", "France", "Belgium", "England"]
for country in countries {
dict[country] = (dict[country] ?? []) + [country]
}
for (key, value) in dict {
println("KEY: \(key) & VALUE: \(value)")
}
Output:
KEY: England & VALUE: [England, England]
KEY: Belgium & VALUE: [Belgium]
KEY: Holland & VALUE: [Holland]
KEY: France & VALUE: [France]
EDIT:
Simplified based on Martin R's link in his comment.
The simplest way is to just loop over the array and check if the key exists in the dictionary. Make each value of the dictionary an NSMutableArray.

How to get areas/regions for a country and cities for this area/region?

We made a simple query which gets some cities:
SELECT * FROM `allCountries` WHERE name='Moscow' and `country_code` = 'RU'
Here is the result of this query:
For example, for another city we get a result with 4-7 rows.
How to get all areas/regions for a country and then get all cities for this area/region?
P.S.: Please be careful. We are not interested in an API site and database fetch. Thanks!
Background
In Geonames you have feature_classes and feature_codes which discriminate the location type. You can find detailed description of the code in the Geonames website. As in your snapshot, P.PPLC means "City (populated place) which is capital of a political entity" and S.HLC means "building (spot) hotel".
Also, every geoname have properties to identify the location in the "hierarchy" inside a country; this properties are country_code, admin1_code, admin2_code, admin3_code, admin4_code. Note that not all properties are used for every given geoname, since this depends on the political organization of a country.
Find all city inside an administrative level
To find all city inside an area (i.e. administrative level), you must first search the geoname for that admin level, in order to have the admin codes useful to filter the city query.
To find an admin level, you must first execute a query like:
SELECT *
FROM `allCountries`
WHERE `country_code` = 'RU'
AND `feature_class`='A'
AND `feature_code`='ADM1'
Note that the query filter out only the first admin levels (feature_code='ADM1'), but you can find admin level of any depth by changing it to:
SELECT *
FROM `geonames`
WHERE `country_code` = 'RU'
AND `feature_class`='A'
AND `feature_code` LIKE 'ADM_'
Now, select one record from this result set and you it to search for the cities, by using the "hierarchy" codes of that level. You should use something like (mutatis mutandis):
SELECT *
FROM `geonames`
WHERE `country_code` = "RU"
AND `feature_class`='P'
AND `feature_code` LIKE 'PPL%'
AND `admin1_code`="<admin1>"
AND `admin2_code`="<admin2>"
AND `admin3_code`="<admin3>"
AND `admin4_code`="<admin4>"
Beware of NULL admin codes, which you need to strip out from the SQL (the whole "AND ..." clause).
Of course, you can do the original "Moscow" search inside this filtered set.
The answer for your question is pretty long, but this code snippet may help you a little bit. These queries obtain all hierarchy information about given geonameid (it's plpython inside postgres).
get_geoname = plpy.prepare("SELECT geonameid, asciiname, country, admin1, admin2 FROM all_countries where geonameid=$1",
["integer"])
get_country_name = plpy.prepare("SELECT name as country from country_info where code = upper($1)", ["varchar"])
get_admin1 = plpy.prepare("SELECT asciiname, name FROM admin1 where code = $1", ["text"])
get_admin2 = plpy.prepare("SELECT asciiname, name FROM admin2 where code = $1", ["text"])

Search four fields in YQL geo.places

We are currently using YQL to query geo data for towns and counties in the UK. At the moment, we can use the following query to find all towns named Boston:
select * from geo.places where text="boston" and placeTypeName="Town"
Demo
The issue is, that we would like to specify the county and country to generate more specific results. I have tried the following query, but it returns 0 results:
select * from geo.places where (text="boston" and placeTypeName="Town") and (text="lincolnshire" and placeTypeName="County")
Demo
How can I query 3 field types to return the results I need? Essentially, we would like to query the following fields:
text and placeTypeName="Town"
text and placeTypeName="County"
text and placeTypeName="Country"
This may be an option maybe:
https://developer.yahoo.com/blogs/ydnsevenblog/solving-location-based-services-needs-yahoo-other-technology-7952.html
As it mentions:
Turning text into a location
You can also turn a text (name) into a location using the following code:
yqlgeo.get('paris,fr',function(o){
alert(o.place.name+' ('+
o.place.centroid.latitude+','+
o.place.centroid.longitude+
')');
})
This wrapper call uses our Placemaker Service under the hood and automatically disambiguates for you. This means that Paris is Paris, France, and not Paris Hilton; London is London, England, and not Jack London.

What are the names of administrativeArea, subAdministrativeArea, etc called outside of the United States?

I am working on a geographically aware high score server, and would like to be able to list scores like "First place in Dutchess County" or "Third place in the State of New York". I can reverse geocode the user's location and get a placemark that lists AdministrativeArea, etc.
The reverse geocoder used by iOS and Google would return "Dutchess" and "New York" for the above examples, so I need to supply "County" and "State" for the United States.
However, the game is global, so I need to know the names of each geographic organization level in other English-speaking countries.
So, in the United States, Google / iOS placemark levels would be described as following:
AdministrativeArea = "State"
SubAdministrativeArea = "County" (or "Parish" in Louisiana)
Locality = "City" or "Town"
Sublocality = (I'm calling this "Neighborhood")
PostalCode = "Zip Code"
What are all of these levels called in other English-speaking countries? (Canada, UK, Australia, New Zealand, Singapore, etc). If there's a resource that lists all these, I would love to know it. I think I may just not know what to search for on the web.
I'm not entirely sure of the effort/value ratio of this exercise. It could get rather difficult, especially with unitary authority areas in England.
In most of the United Kingdom,
country is "United Kingdom" [ie the country name]
administrative_area_level_1 could be "England" or "Scotland" etc [ie the country name]
administrative_area_level_2 might be "East Sussex" [county]
locality might be "Hailsham" [town]
postal_code is a postcode
However in London,
administrative_area_level_2 is "London" which isn't a county
administrative_area_level_2 might be "Greater London" too
administrative_area_level_3 might be "London Borough of Lewisham" [yay! Borough makes sense]
locality is "London" which isn't a locality
In unitary authority areas in England,
administrative_area_level_2 might be "Medway"
Unitary authorities replace county council and borough/district councils, but they are located within a "ceremonial county" which is what most people will use ordinarily. Places in Medway Council's area are in Kent. Unfortunately these county names aren't returned by the geocoder. Some counties (eg Berkshire) were abolished completely and replaced entirely by unitary authorities. However the old county name (Kent, or Berkshire) is the right name to use.

Resources