Identifying and relating cities from different sources - analysis

I have different providers which passes me an excel with different cities, in each city they use some special code for their operations and more data useful to my business.
The problem is that I have a mess with all these cities:
I have my own cities in my database, around 9000 records.
Provider A gives me his excel or webservice to get around 6000.
Provider B gives me another 5000.
Provider C ... etc
Some of the cities given by my providers are already in my database and I only have to update the required data I need.
Otherwise, I have to insert that new city in my database.
And this, each time a provider gives me an update of these cities.
Well, the main problem is that I call a city differently from them, and they differently from each other... how to know if I already have that city or I have to create a new one since we use different names?
The way I see it, I only can achieve it manually. Comparing their cities with mines.
Of course, it's too much work so I made my own script, and implementing the levehnstein function for the database, I can automatically see the more coincident ones and select them by a click. The script does the rest (updates their special operation code for that city into my corresponding city stored in my database).
Even with it, I still feel like I'm missing something. If there was an unicode for those cities this would be much easier and automatic, but I don't have any code which identifies these cities more than my table identifier. Same for my providers, despite some of the use to provide me the postal code among the cities their provide, but not all.
Is there any better solution than mine for this? Any universal code that you usually use or any other aproatch?
Edit:
Well, each city belongs to a country. Of course, I'm considering that.
In my city table I have an Id for each destination, and then a column for the operation code of each provider (I know, this could be better represented with a relationship more), plus country code, zip, url for seo...
Respecting the solution mentioned by MagnusL, creating a Synonyms table, why would I need to store the synonyms? Regarding the script you mentioned with levehnstein and human interaction, that's exactly what I'm currently doing:
With each record provided by a provider and my destinations table. Given a provider city record, I'm showing the more coincident ones from my table.
But before this, I automatically link all those which are coincident in zip code and country.
It's a lot of work for updating my providers special operation code for each city. I am just curious about how people deal with this problem, I'm sure a lot of developers have to face this at some point.

If it is important that the cities are correctly matched, I would guess you must have some manual steps in your process. If you include names of smaller towns you will some day encounter that the same name could actually be two different places in two different countries. (Try Munich on Google Maps and you get one in Germany and one in North Dakota.)
A somewhat complicated, but I guess future proof, workflow is to use id numbers in place of city names in your main data table. Then set up a locations table with those id numbers as primary keys and your preferred name of the city followed by as many meta data columns as required for country code, zip code, WGS84 coordinates, continent name, whatever. Add another table for city name synonyms, with just id numbers and names (without UNIQUE constraint on the id column).
Let your import script try to match the city with help from as many meta data as possible (probably different meta data from different providers), together with the Levehnstein algorithm you mentioned, and let it be clever enough to ask for human interaction in those cases where no one or more than one city are matched. It can of course show you the closest possible guesses, so you can pick the right one and have it stored in the synonym table.
(Yes, it is a lot of coding to get there. If you find it worth it or not depends on how often you do these updates.)
Tip: Wikipedia has articles with different names on cities, i.e. https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages

What if you used an extra table for name translation?
IE, the table would have 2 columns: column A the name you use, column B, the name a provider uses. You might need to do adapt this table manually, to look like:
Bruxelles:Brussels
Bruxelles:Brussel
Bruxelles:Bruxelles
While importing, for the name of the city you would then use
select A where B = Brussels
In your agglomerated database, names would then be consistent.

Related

Best Practice: Managing string list with entity framework and migration

I've been searching for the best practice for managing/storing a list of strings in the database optional with entity framework, but migration should be supported.
E.g.
I have a list of city-names which I like to store in a new table. This table contains all available citys in my project.
I've a property City in a Address class, which address one of the citys in my City-Table.
1. Is it better to set a reference to the entry in the City-Table or apply and store the value in the Address - Table ?
2. What's the best practice for creating the City-Table? Generating a class City in my Model seems to be a little bit too much overhead, but where can I manage/create it if I like to reference the entries in the Address?
It all comes down to what your application will be able to do. If you are planning to do some operations with cities, it would be wise to have them as separeted entities. Otherwise, you would have to query the Address table, and group the cities (which will offer some troubles, like people entering "Moron" and "MorĂ³n" as cities, for example).

How to model country, state and cities using Neo4j

I'm a building a registration form for my website(it is using Neo4j) and need to populate the country, state and city field. All these fields are inter-linked i.e depending on country, state field will be set and depending on state city will be set. I'm trying to figure out what's the best approach to model this using Neo4j. Do I need to create nodes for each country, state and city, and then create relationships between all of them? For instance, Detroit - belongs to - Michigan - belongs to - United States. What would be the best approach to handle this in Neo4j? Are there any examples to look at ? Would it be efficient to do this in Neo4j ? Or is it better to use a document based DB for that such as MongoDB?
I don't see any reason you can't do what you suggested, creating nodes for City, State, Country and wiring them up (I'm planning on doing this exact same thing with my upcoming project). This also lets you reuse those nodes in other parts of your graph, potentially allowing you to make interesting queries using common locations at faster speeds than property comparisons.
If I understand your requirements correctly, you'll have dropdowns or autocomplete fields or similar to drill down to each level (populate dropdown with countries -> populate next dropdown with states in the selected country -> populate last drop down with cities in the selected state). Just add indexes on identifier or abbreviation for quick node lookup and you're good, it should work quite fast.
If you're adding zip codes in there, that could be tricky, as you can't really model it in the same way. You'll have one-to-many relationships from both state and city to zip, and unless I'm mistaken there are a few interesting zips which can span more than a single state and/or city. Some other factors that can complicate things include 5 vs 9 digit zips (or more for other countries), and handling of zip-equivalents in other countries, as they may adhere to different logic.

Custom fields in Rails that act as a template for future entries

I'm looking for some feedback on my current plan of implementing custom fields in rails. I'm new to rails and app development in general and would appreciate any comments from more experienced individuals.
Background
The app: Keep track of food and beverage tastings.
What I'm trying to model:
User creates a new sample type.
They call it: "Wine"
They decide for their company, they'd like to keep track of the following attributes: Origin, Grape Type, Company, Elevation,Temperature Kept, and more.
The only assumptions about a sample type that my database has made is that it has a Name. (eg. coffee, wine, etc.) the rest are all custom fields specified by the user.
Now that a sample type has been created.
The user begins to create samples of sample type wine.
They choose create sample, choose of type Wine.
The fields they must fill in are the ones they specified earlier.
In Origin they put: France, in Grape type: they put chardonnay, etc..
--
My plan of approach is as follows:
When a user creates the sample type, store the custom fields as an array or in some string format and keep it under a column called data.
SampleType
name
wine
data
[origin, grape_type, company, ...]
When a user wants to create a sample of type Wine:
I look up the sample type wine, for each key in the data column, it creates form fields.
When the user submits the data, I create a hash of all the custom fields names and their corresponding data. I serialize it and store it in a hash in a data column like such:
Sample
type
wine
data
{ origin: "France", grape_type: "Pinot Grigio, ... }
My plan at the moment is to use PostgreSQL's hstore to implement the hashing in the data column.
My questions are:
Is this a valid solution for what I'm trying to do?
Will I run into trouble when users change what custom fields they want?
Any other concerns I should take into account?
Is mongodb and other such db's a better choice for this type of model?
I've been using the following links as a reference:
http://schneems.com/post/19298469372/you-got-nosql-in-my-postgres-using-hstore-in-rails
http://blog.artlogic.com/2012/09/13/custom-fields-in-rails/
As well as many other stack overflow posts, however none seem to be using it in the way I mention above.
Any comments are appreciated.
jtgi, having done something like this more times than I want to remember, my first response was, "run away!" In my experience, the whole user-defined field thing is an ugly, hacky, nightmare. Soon, someone will ask, "can I search on grape?" or "I want to be able to input multiple values for grape." And on and on, and you will hate yourself for ever stepping down this path. :-)
That said, I think your approach is pretty decent. To answer your questions directly:
Yes, this is a valid approach.
Yes, you will run into trouble when users change the custom fields they want. (see above)
See some notes below.
Might be. I went there even before I read your 4th question. With your field => value hash, you're kind of implementing a noSQL solution anyhow, but it'll be non-trivial to implement lookups, searches, etc.
Some thoughts:
I think I would marshal the data into a db column, rather than using a db function. That way, it's pure Ruby and not dependent on the db type. See http://www.ruby-doc.org/core-1.9.3/Marshal.html. I'm doing this to cache some data in an app right now, and it's pretty slick. You may need to marshal(l) the data anyhow, if you want to wind up storing Ruby objects more complex than strings.
You'll probably get there soon anyhow, so I would plan on storing some "metadata" about the attributes while you're at it. E.g., "grape" is a String, max length 20, "rating" is an integer between 0 and 100. That way you can make your form a little prettier and do some rudimentary validation.
When you come to hate this feature, you can remember me. :-)

Database table design

I have a table of property listings. I need to add cities to these listings. Is it best practice to split a list of cities into it's own table?
I would like the user when adding a new property to be able to select from a list of cities.
By the way this is a Rails project.
A cities lookup table makes sense in this case.
This will also allow you to add more information for each city in the future, if needed.
If there is only one city per property, there is nothing terribly wrong with putting it in the properties table. If there are more, there is no good choice but to use a cities table.
Alternatively, if you want to pick the cities from a drop down list with no additions allowed, having a cities table may be a good idea. If you do that then you probably want to store the cityid not the city name in the property table. That way when someone changes the name of a city (which admittedly probably doesn't happen very often) you only have to change one record. Of course if you do have a cities table, you must have a foreign key and make sure city_id is indexed in the properties table to maintain your data integrity.
Yes. It is generally a best practice to normalize your database schema such that you are not repeating the same city names in multiple property listing records in your property listing table.
There are cases where you would want to denormalize for performance reasons. I would not consider your case to be one of these cases until it proves itself to be (i.e. table reads become very slow.) Even then, there are optimizations you could undertake prior to denormalizing your schema.

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Resources