Ruby gem for parsing country + city from a string? - ruby-on-rails

I am trying to determine the cities, states, and countries for twitter users.
The location field returns the location, but I need to parse it and store this data in a structured format.
For instance, if the location in an user's bio is "London", it should store the city as London, and the country as UK. If it's "Albany, NY", it should store the city as Albany, the state as NY, and the country as USA. If it's just "NY", it should store the state as NY, and country as USA. If it's "India", it should store the country as India (with no city or state). Obviously if the location is nonsense like "outer space", it will return nothing.
Is there a gem out there that does something like this? If not, is there any way I can do this intelligently leveraging some 3rd party?

I faced the same problem to gelocalize twitter location. The best and free service i found is openstreetmap.
It is really easy to use and the response is JSON.
try yourself: http://nominatim.openstreetmap.org/search?q=london&format=json&&addressdetails=1&accept-language=en
Here the first element that match "london":
{
"place_id": "97592906",
"licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
"osm_type": "relation",
"osm_id": "65606",
"boundingbox": [
"51.2867584228516",
"51.6918754577637",
"-0.510375142097473",
"0.334015518426895"
],
"lat": "51.5072759",
"lon": "-0.1276597",
"display_name": "London, Greater London, England, United Kingdom",
"class": "place",
"type": "city",
"importance": 0.9654895765402,
"icon": "http:\/\/nominatim.openstreetmap.org\/images\/mapicons\/poi_place_city.p.20.png",
"address": {
"city": "London",
"county": "London",
"state_district": "Greater London",
"state": "England",
"country": "United Kingdom",
"country_code": "gb"
}
}
As you can see the address field contains all the informations you need.

Related

Python Named Entities Recognition finding a specific entity

I currently have a project about NLP, I try to use NLTK to recognize a PERSON name. But, the problem is more challenging than just finding part-of-speech.
"input = "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."
So, the challenge is: I just want to get the attorney's name as the return from the whole document, not the other person, so "John Smith", part-of-speech: PERSON, occupation: attorney. The return could look like this, or just "John Smith".
{
"name": "John Smith",
"type": "PERSON",
"occupation": "attorney"
}
I have tried NLTK part-of-speech, also the Google Cloud Natural Language API, but it just helped me to detect the PERSON name. How can I detect if it is an attorney? Please guide me to the right approach. Do I have to train my own data or corpus to detect "attorney". I have thousands of court document txt files.
The thing with pre-trained Machine Learning models is that there is not much space for flexibility in what you want to achieve. Tools such as Google Cloud Natural Language offer some really interesting functionalities, but you cannot make them do other work for you. In such a casa, you would need to train your own models, or try a different approach, using tools such as TensorFlow, which require a high expertise in order to obtain decent results.
However, regarding your precise use case, you can use the analyzeEntities method to find named entities (common nouns and proper names). It turns out that, if the word attorney is next to the name of the person who is actually an attorney (as in "I am John, and my attorney James is working on my case." or your example "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."), it will bind those two entities together.
You can test that using the API Explorer with this call I share, and you will see that for the request:
{
"document": {
"content": "I am John, and my attorney James is working on my case.",
"type": "PLAIN_TEXT"
},
"encodingType": "UTF8"
}
Some of the resulting entities are:
{
"name": "James",
"type": "PERSON",
"metadata": {
},
"salience": 0.5714066,
"mentions": [
{
"text": {
"content": "attorney",
"beginOffset": 18
},
"type": "COMMON"
},
{
"text": {
"content": "James",
"beginOffset": 27
},
"type": "PROPER"
}
]
},
{
"name": "John",
"type": "PERSON",
"metadata": {
},
"salience": 0.23953272,
"mentions": [
{
"text": {
"content": "John",
"beginOffset": 5
},
"type": "PROPER"
}
]
}
In this case, you will be able to parse the JSON response and see that James is (correctly) connected to the attorney noun, while John is not. However, as per some tests I have been running, this behavior seems to be only reproducible if the word attorney is next to one of the names you are trying to identify.
I hope this can be of help for you, but in case your needs are more complex, you will not be able to do that with an out-of-the-box solution such as Natural Language API.

(Cloudant) Creating a view to combine two document types

Let's say that I'm making a Cloudant database to store all the service records for my fleet of cars (I'm not, but the problem is pretty much the same.) To do this, I have two types of records:
Cars:
{
"type": "Car",
"_id": "VIN 1",
"plateNumber": "ecto-1",
"plateState": "NY",
"make": "Cadillac",
"model": "Professional Chassis",
"year": 1959
}
{
"type": "Car",
"_id": "VIN 2",
"plateNumber": "mntclmbr",
"plateState": "VT",
"make": "Jeep",
"model": "Wrangler",
"year": 2016
}
And service records:
{
"type": "ServiceRecord",
"_id": "service1",
"carServiced": "VIN 1",
"date": [1984, 6, 8],
"item": "Cleaning (Goo)",
"cost": 300
}
{
"type": "ServiceRecord",
"_id": "service2",
"carServiced": "VIN 1",
"date": [1984, 6, 9],
"item": "Cleaning (Marshmellow)",
"cost": 800
}
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": "VIN 2",
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
There's a couple things to note about how this works:
The VIN number of a car will never change is used as the document _id.
The service records for a car should not be lost if the car is registered in a new state, or with a new plate number.
Due to the volume of cars, and how often they need repairs, it's not reasonable to edit a car's document if a service record needs to be added, removed, or changed.
Currently, I have a couple views to look up information.
First, I've got a map from license plate to VIN:
function(doc){
if (doc.type == "Car"){
emit([doc.plateState, doc.plateNumber], doc._id);
}
}
// Results in:
["NY", "ecto-1"] -> "VIN 1"
["VT", "mntclmbr"] -> "VIN 2"
Second, I've got a map map from all the cars' VINs to the service records:
function(doc){
if (doc.type == "ServiceRecord"){
emit(doc.carServiced, doc);
}
}
// Results in:
"VIN 1" -> {"_id": "service1", ...}
"VIN 1" -> {"_id": "service2", ...}
"VIN 2" -> {"_id": "service3", ...}
Finally, I've got a map map from all the cars' VINs and service dates to the specific service that happened on that date:
function(doc){
if (doc.type == "ServiceRecord"){
var key = [doc.carServiced, doc.date[0], doc.date[3], doc.date[2]];
emit(key, doc);
}
}
// Results in:
["VIN 1", 1984, 6, 8] -> {"_id": "service1", ...}
["VIN 1", 1984, 6, 9] -> {"_id": "service2", ...}
["VIN 2", 2016, 4, 2] -> {"_id": "service3", ...}
With these three maps, I can find three different things:
The VIN of any car by its license plate.
The service records of any car by its VIN.
The service records of any car by its VIN for any particular year, month, or day.
However, can't find all the service records of a car by its license plate. (At least not in one step.) To do that, I would need a map like this:
["NY", "ecto-1"] -> {"_id": "service1", ...}
["NY", "ecto-1"] -> {"_id": "service2", ...}
["VT", "mntclmbr"] -> {"_id": "service3", ...}
And to make it even MORE complicated, I'd like to be able to look up service records by license plate AND date, with a map like this:
["NY", "ecto-1", 1984, 6, 8] -> {"_id": "service1", ...}
["NY", "ecto-1", 1984, 6, 9] -> {"_id": "service2", ...}
["VT", "mntclmbr", 2016, 4, 2] -> {"_id": "service3", ...}
Unfortunately, I don't know how to generate maps like these because the key requires information from two documents. I can only get plate information from Car documents and I can only get service information (including the document _id for the value of emit) from ServiceRecord documents.
So far, my only thought is to do two queries: one to get the VIN from the plate info, and another to get the service records from the VIN. They'll be fast queries, so it's not a huge problem, but I feel like there's a better way.
Anyone know what that better way might be?
(Bonus: The two-query method does not allow for finding all service records by state in an efficient way. The last map I describe would be able to do that. So bonus internet-points for anyone who can describe a solution that provides that functionality as well.)
**Edit: Another issue, here, was suggested as a possible duplicate. It is definitely a similar problem, however the solutions provided do not solve this issue. Specifically, the top solution suggests storing an document's position within the tree. In this case, that would be something like "index":[State, Number, Year, Month, Day]" in a ServiceRecord document. However, we can't do that because the plate information can easily change.
Hopefully you are still around. The gist of the answer is : in CouchDb when you feel a need to do joins you are 99% of the times doing something wrong. What you need to do is have all the information you need in one document.
You need to get into the habit of thinking about how you are going to query your data when you design what to save. You will find that replacing the "relational normalization" habit with this habit is healthy.
What you can do here is save the licence plate number in the service record document. Don't be afraid to denormalize. A service record should therefore look like this :
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": "VIN 2",
"carPlateNumber": "mntclmbr",
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
And you can easily do whatever you want from here. That being said, the architect I am can smell that you are likely to invent new ways to query this data every month. For this reason, I'd personally prefer to store the whole car document within the service record :
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": {
"type": "Car",
"_id": "VIN 2",
"plateNumber": "mntclmbr",
"plateState": "VT",
"make": "Jeep",
"model": "Wrangler",
"year": 2016
},
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
This is absolutely fine. Especially since a service record is a snapshot in time and you do not need to worry about updating the information. I actually find that this is one of the scenarios where CouchDb particularly shines as storing a snapshot basically is a free lunch (as opposed to managing a cars_snapshot table in a relational system). And we tend to forget it but very often (especially as far as sales are concerned), we are interested in snapshots, not up-to-date relational data (what was the customer name at the time he bought, what was the tax rate at the time he bought, etc.). But relational systems put us in the "most up to date by default" habit because snapshot management involves a significant overhead there.
The bottom line is that this kind of denormalization is absolutely fine in CouchDb. You are in the intended usage and won't be bitten in the back down the road. As CouchDb puts it : just relax ;)
It sounds like chained mapreduce could provide your solution?
https://examples.cloudant.com/sales/_design/sales/index.html

Finding IP Address Ranges for a U.S. City

I'm trying to find some IP addresses for testing IP geolocation functionality on a website. Does anyone know of a good way to find static IP addresses for certain cities (i.e. Seattle, Los Angeles), or a good way to find IP ranges for a U.S. city?
My service http://ipinfo.io has an API that let's you lookup cities for a given IP, eg:
$ curl ipinfo.io/8.8.8.8
{
"ip": "8.8.8.8",
"hostname": "google-public-dns-a.google.com",
"loc": "37.385999999999996,-122.0838",
"org": "AS15169 Google Inc.",
"city": "Mountain View",
"region": "California",
"country": "US",
"phone": 650
}
There's no API to do the reverse (find IPs for a given city), but you can use do a google search like site:ipinfo.io to turn up IPs from the given city. For example, searching for Seattle, US turns up the following pages:
http://ipinfo.io/97.113.203.115
http://ipinfo.io/174.21.174.240
http://ipinfo.io/54.200.79.127
http://ipinfo.io/207.171.163.31

What are the names of administrativeArea, subAdministrativeArea, etc called outside of the United States?

I am working on a geographically aware high score server, and would like to be able to list scores like "First place in Dutchess County" or "Third place in the State of New York". I can reverse geocode the user's location and get a placemark that lists AdministrativeArea, etc.
The reverse geocoder used by iOS and Google would return "Dutchess" and "New York" for the above examples, so I need to supply "County" and "State" for the United States.
However, the game is global, so I need to know the names of each geographic organization level in other English-speaking countries.
So, in the United States, Google / iOS placemark levels would be described as following:
AdministrativeArea = "State"
SubAdministrativeArea = "County" (or "Parish" in Louisiana)
Locality = "City" or "Town"
Sublocality = (I'm calling this "Neighborhood")
PostalCode = "Zip Code"
What are all of these levels called in other English-speaking countries? (Canada, UK, Australia, New Zealand, Singapore, etc). If there's a resource that lists all these, I would love to know it. I think I may just not know what to search for on the web.
I'm not entirely sure of the effort/value ratio of this exercise. It could get rather difficult, especially with unitary authority areas in England.
In most of the United Kingdom,
country is "United Kingdom" [ie the country name]
administrative_area_level_1 could be "England" or "Scotland" etc [ie the country name]
administrative_area_level_2 might be "East Sussex" [county]
locality might be "Hailsham" [town]
postal_code is a postcode
However in London,
administrative_area_level_2 is "London" which isn't a county
administrative_area_level_2 might be "Greater London" too
administrative_area_level_3 might be "London Borough of Lewisham" [yay! Borough makes sense]
locality is "London" which isn't a locality
In unitary authority areas in England,
administrative_area_level_2 might be "Medway"
Unitary authorities replace county council and borough/district councils, but they are located within a "ceremonial county" which is what most people will use ordinarily. Places in Medway Council's area are in Kent. Unfortunately these county names aren't returned by the geocoder. Some counties (eg Berkshire) were abolished completely and replaced entirely by unitary authorities. However the old county name (Kent, or Berkshire) is the right name to use.

latitude longitude from craigslist

Using YQL and apartment search location from craigslist, I get a result in the following form. Is there any way I can get latitude,longitude information from this ? or do i have to geocode the address ? are there any other source apart from craigslist that can be used to get property details along with geo-location?
{
"about": "http://kolkata.craigslist.co.in/apa/2559284148.html",
"title": [
"Temporary Stay Rental (Kasba Area, Kolkata) 2bd 1100sqft",
"Temporary Stay Rental (Kasba Area, Kolkata) 2bd 1100sqft"
],
"link": "http://kolkata.craigslist.co.in/apa/2559284148.html",
"description": "FURNISHED apartment for executives, professionals, NRIs and visitors for temporary stay in Kolkata (Calcutta).<br>\n<br>\nLOCATION: Kasba area near Delhi Public School - close to Gariahat and EM Bypass<br>\n<br>\n2 bedrooms, 2 bathrooms, a spacious living room, a separate dining room and a modular kitchen. Cooking facility - cooking gas burner, utensils, refrigerator etc. <br>\n<br>\nApartment located at the second floor of a four story building. Elevator available.<br>\n<br>\nHot water, air conditioned bedrooms and one covered parking space available.<br>\n<br>\nRent is INR 2,000.00 (or USD 50.00) per day for a minimum stay of 15 days. Costs of electricity and cooking gas charged separately on actual usage.<br>\n<br>\nShorter stay possible at negotiable rates.<br>\n<br>\nAppropriate ID required for renting.<!-- START CLTAGS -->\n\n\n<br><br><ul class=\"blurbs\">\n<li> <!-- CLTAG GeographicArea=Kasba Area, Kolkata -->Location: Kasba Area, Kolkata\n<li>it's NOT ok to contact this poster with services or other commercial interests</ul>\n<!-- END CLTAGS -->",
"date": "2011-08-22T10:05:43+05:30",
"language": "en-us",
"rights": "Copyright © 2011 craigslist, inc.",
"source": "http://kolkata.craigslist.co.in/apa/2559284148.html",
"type": "text",
"issued": "2011-08-22T10:05:43+05:30"
}
No. You will need to map the regionId's to the coordinates or zipcode.

Resources