Geocoding - Grouping multiple addresses into major cities - geolocation

Been hunting through previous questions on Geocoding and while many are helpful I'm not finding one to my needs.
I need to group multiple addresses to the nearest city centers. My only address information is city, country, and state (if applicable). For example, all addresses in San Francisco and within miles should be listed as San Francisco. I'll need to know the count of addresses rolled-up to San Francisco.
I'm open to suggestions on how to approach this. I don't particularly want to manually identify a list of major cities if possible. Is there a list of these I can start from?
What about using an average lat/long location of all addresses within miles? Granted the final 'center point' would move around a bit as the average is computed but perhaps that is an approximate solution. Not quite sure how to do this so again, appreciate input!

Great question. I think more generally what you want is some standard way of rolling up cities into metropolitan areas and you're exactly right that you don't want to create or maintain a list of your own.
Yahoo! GeoPlanet provides a geographic ontology with a pretty thorough hierarchy. If you were happy with standard administrative divisions (like county or state), it would be easy, but I think you're looking for something a little more general than that. But GeoPlanet also provides zones, often -- in the US -- including the town's Metropolitan Statistical Area.
If you have each city name, you could use GeoPlanet to find any MSA zones that the city belongs to and roll up to that (and GeoPlanet provides a bounding box and centroid for each MSA so you can easily place it on a map). For rural towns that aren't a part of a US census bureau MSA you may not need to group it to the nearest city (which may be far away anyway).

Related

Creation of a platform for scientific studies and surveys

I am looking for competent people to create a platform for scientific studies and globalized surveys. I would like each human being to have an identification number (identity verification is important otherwise the study will not be valid) and with this system each person will be able to give their opinion on different issues. The objective is therefore on a large scale, the system will have to be efficient anywhere and with many people connected at the same time.
The data war has already begun. My goal is to counterbalance the power by giving it to humanity. The platform will therefore disturb -> many hacking attacks to be expected and coveted for its precious data that we must protect.
It is a project of the heart, the dimension is not pecuniary even if we will earn money, it is humanistic. Let's stop the mass manipulation, let's put the people with real practical solutions in the spotlight and let's be aware of the real percentages behind an idea.
I'm going on a world tour in a few days, I'd like to take the opportunity to highlight the project around the globe but for that I need the interface to be functional.
I hope you like the idea. I look forward to talking to you about it and finding out how and how quickly it can be done.
Sincerely Louise

Dealing with Address Dimension and role playing it in multiple facts

A question in regards to Dimensional Modelling and Role Playing.
We have an Address dimension which is ‘role playing’. We receive Addresses from different sources including CRM systems. Addresses could also be of different types, such as Address of a company, individual etc. So from the Role Playing Address dimension, a single address could be tagged as the Address of a company and Address for billing in different facts.
There are different fact tables and they have different keys which would hold address data. Fact_Sales would have keys such as Customer_Address_Key, Company_Head_Office_Address_Key. So I believe we are kind of role playing the addresses in these facts.
Question:
Our lead Data Architect has a concern around this.
• We are capturing a lot of addresses from a number of systems. How would we identify where these addresses came from, and what type of addresses are these without going to the fact tables.
I would still suggest going through the facts, but I would like to consult the wider community over there before putting my feet firmly on the ground.
Is there any better way to do this, perhaps a separate table which defines the combination of Address_Key, Address_Type_Key and Source_Key.
Please let me know if you need any further clarification or pictures etc.
Cheers
Nithin
It sounds like in the situation you have that you should just include columns for the type of address and the source of the address in the address dimension itself, so it stands alone and you don't have to go via a fact to know what kind of thing it is. You wouldn't need a separate table with keys as you mentioned- the data can safely be denormalised in the dimension.
As an aside:
Although many people do have an address table which is separate, the approach from the Kimball Group would not be to have have 'address' or location dimension as a multi purpose dimension that stands alone- it provides part of what describes something else (like a company, or a customer, or even a 'delivery location'). Instead you'd have the dimension (e.g Customer) and Within that dimension you'd have a number of Address fields, named appropriately (CustomerAddress1, CustomerAddress2, CustomerCity). You may choose to administer the address centrally for convenience behind the scenes, with the other dimensions formed by means of views or further ETL, but in the presentation of the star schema the address table would not be seen separately. The addresses are still conformed in that they're called the same thing and mean the same thing.
However plenty of people go with a separate Address table as you've done
It is very reasonable to include source as an attribute of the dimension. The bigger question is how do you select the "Current" address for a customer if you have multiple sources. That is where things will get tricky.
You need Current Customer Address to mean the same thing throughout your business regardless of the source from which it was captured. I would refer to this as a conformed dimension. You need to 'conform' all of your addresses sources to the same structure so you can use them as a single dimension.
In the large majority of your facts, the source of the address is irrelevant. You are only needing to know that it is the current address. You may have a smaller model that can provide analysis on the source of the customer address.
The hard part is deciding which source is most trustworthy when the address is in multiple sources. You need to consider the source and the date of the last update. In other words, is the primary source still preferred when a less trustworthy source has a more recent update.
Type is usually just an attribute of the address. However, if your address can be used for multiple things (physical, shipping, billing, etc), that may need to be defined by the role-playing relationship. For other analytics on address, you can break city/state & zip into separate dimensions if you need to break things down by geographic location. I would recommend City & State be used as a single entity. If you treat City as separate from State, you'll get funny results when slicing by cities that exist in more than one state.

Postal code database normalisation

With reference to localities and postal codes
Each postal code can have one more localities
Each locality can have one or more postal codes
Accordingly should this be created as a M:M scenario with a 3rd join table 'areas'?
The postal code table would only have a single column being the postal code itself and the locality table would also only have a single column being the locality name.
The alternative is a single table including both but it would result in repeated data.
Thanks in advance...
The question you have asked leaves open mostly to opinion. There are many factors that might make you lower the normalization based on the goals of how you plan to query the data.
Traditional normalization usually suggest the M:M scenario is correct, but that leaves applications constantly joining 3 tables to relate the information, and that may not be the most efficient if the applications do this in high frequency.
The alternative of a single table with repeated data could be optimal if accompanied by well designed non-clustered indexing so that joins are minimized and index seeks optimized in execution plans. However, storage would be taxed due to the non-clustered indexed, and apps of course have to know that the data coming back could be duped. But if the point is simply validating if a locality is within a zip code, this is expected.
Short story, there is the textbook answer in a perfect world, and then practically there may be other factors of performance, storage, query optimization, and application tendencies that could make lower normal forms preferable for certain situations.

Business location dataset

I'm looking for a good dataset of business locations, hopefully for all of the USA. I'd love to have "name," "business type," and "lat/long," although I'd settle for "street address" rather than "lat/long," and I could geocode the points myself.
Are there any free or relatively cheap data sources for business locations? Can I get this information from google?
You might want to settle for yelp dataset. It seems to be the closest to what you are looking for. You'll need to be associated with an academic institution to qualify for access though. Check out this link : http://www.yelp.com/academic_dataset

Finding geo located "Hot Spot" Areas?

Good Afternoon,
I'm currently planning a web-app/service project with a geolocation-enabled user model (lat/lng etc) and I was wondering what would be the best approach to find out the n biggest 'hot spots', e.g. geolocations with a given radius (e.g. 10 miles) where the most users are located at?
Does anyone know a good, practical clustering algorithm or other (existing) solution(s)? This is a pretty bird-view kind of question, I know... but backend technology wise I'm still open to anything as this particular feature is obviously only one of the whole feature set, but might help making a decision towards a particular set of tools/languages/environments.
Cheers & thanks,
-J
SQL Server's spatial data types would be worth a look. It allows you to index on the geography column and do queries for distance. Not sure how easy it would be to group by radius, but at least having the geography data type and building indexes on it should help a lot with this type of problem.
Geography Methods Supported by Spatial Indexes

Resources