Zip code based search - ruby-on-rails

I have an app where shop owners can enter 10 zip codes in which they can provide services. Currently these zip codes are stored in single table column. Now what is the best and efficient way to do search based on this? Should I store all zip codes(all US zip codes) in a table and establish many to many relationship or do text search based on the current field using thinking sphinx?

A database guy's perspective . . .
Since you're talking about using Sphinx, I presume you store all 10 ZIP codes in a single row, like this.
shop_id zip_codes
--
167 22301, 22302, 22303, 22304, 22305, 22306, 22307, 22308, 22309, 22310
You'd be far better off storing them like this, for search and for several other reasons.
shop_id zip_codes
--
167 22301
167 22302
167 22303
167 22304
167 22305
167 22306
167 22307
167 22308
167 22309
167 22310
-- Example in SQL.
create table serviced_areas (
shop_id integer not null references shops (shop_id), -- Table "shops" not shown.
zip_code char(5) not null,
primary key (shop_id, zip_code)
);
You can make a good case for stopping after making this single change.
But you can increase data integrity substantially without making any other changes to your database if your dbms supports regular expressions. With that kind of dbms support, you can guarantee that the zip_code column contains only 5 integers, no letters. (There may be other ways to guarantee 5 integers and no letters.)
A table of ZIP codes would further increase data integrity. But you could easily argue that the shop owners have a vested interest in entering valid ZIP codes in the first place, and that this isn't worth more effort on your part. ZIP codes change pretty often; don't expect a "complete" table of ZIP codes to be accurate for very long. And you need to have a well-defined procedure for dealing with both new and expired ZIP codes.
-- Example in SQL
create table zip_codes (
zip_code char(5) primary key
);
create table serviced_areas (
shop_id integer not null references shops (shop_id),
zip_code char(5) not null references zip_codes (zip_code),
primary key (shop_id, zip_code)
);

You will need the zipcodes plus latitude/longitude in your db if you're using sphinx to do geospatial search (not really, you can use a text file or xml I suppose).
By geospatial search I mean something like "Find stores within 20 miles of your location"

For flexibility and efficiency, I would pick #1 ....
"store all zip codes in a table and establish many to many
relationship"
...with the assumption you also need to store other zip code data fields (City, State, county, Lat/Long, etc.). In that case your intersection would be: shop_id to zipcode_id(s). However, if you do not need/have extended ZIP Code data fields, then a single separate table with shop_id to the acutal zipcodes (not id) will be fine in my opinion.

Related

Rails how to make a table with composite key: (Category and Month/Year), and value: Amount

I need to report a budget to actual table, displaying net values by month per category.
Currently I'm calculating on the fly but it's already slow with hardly more than seeded data.
I think the proper way to do this is with a composite key table. The key is the unique combination of Category and Period (Month-Year), the value is the amount.
Each transaction would trigger a callback to update this table. At report time, a user would supply the period of the report (March 2022) and it would find all rows of that period.
I think it would look something like this:
key
value
March-2021/All Categories
$335
March-2021/Food
$75
March-2021/Fuel
$60
March-2021/Entertainment
$200
...
...
March-2022/All Categories
$49
March-2022/Food
$25
March-2022/Fuel
-$10
March-2022/Entertainment
$34
...
...
April-2022/All Categories
$58
April-2022/Food
$5
April-2022/Fuel
$30
April-2022/Entertainment
$23
April-2002/Some_New_Category
$20
I've read this SO question but I'd rather not blindly use a gem, and anyway it seems to involve composite 'primary' keys. (User model, and Organization model and Department model). Whereas I'm just using a Category model and a non-model period (Month-Year)
How to implement composite primary keys in rails
I can't seem to find anything about how to implement this. Am I approaching this wrong?

Identifying and relating cities from different sources

I have different providers which passes me an excel with different cities, in each city they use some special code for their operations and more data useful to my business.
The problem is that I have a mess with all these cities:
I have my own cities in my database, around 9000 records.
Provider A gives me his excel or webservice to get around 6000.
Provider B gives me another 5000.
Provider C ... etc
Some of the cities given by my providers are already in my database and I only have to update the required data I need.
Otherwise, I have to insert that new city in my database.
And this, each time a provider gives me an update of these cities.
Well, the main problem is that I call a city differently from them, and they differently from each other... how to know if I already have that city or I have to create a new one since we use different names?
The way I see it, I only can achieve it manually. Comparing their cities with mines.
Of course, it's too much work so I made my own script, and implementing the levehnstein function for the database, I can automatically see the more coincident ones and select them by a click. The script does the rest (updates their special operation code for that city into my corresponding city stored in my database).
Even with it, I still feel like I'm missing something. If there was an unicode for those cities this would be much easier and automatic, but I don't have any code which identifies these cities more than my table identifier. Same for my providers, despite some of the use to provide me the postal code among the cities their provide, but not all.
Is there any better solution than mine for this? Any universal code that you usually use or any other aproatch?
Edit:
Well, each city belongs to a country. Of course, I'm considering that.
In my city table I have an Id for each destination, and then a column for the operation code of each provider (I know, this could be better represented with a relationship more), plus country code, zip, url for seo...
Respecting the solution mentioned by MagnusL, creating a Synonyms table, why would I need to store the synonyms? Regarding the script you mentioned with levehnstein and human interaction, that's exactly what I'm currently doing:
With each record provided by a provider and my destinations table. Given a provider city record, I'm showing the more coincident ones from my table.
But before this, I automatically link all those which are coincident in zip code and country.
It's a lot of work for updating my providers special operation code for each city. I am just curious about how people deal with this problem, I'm sure a lot of developers have to face this at some point.
If it is important that the cities are correctly matched, I would guess you must have some manual steps in your process. If you include names of smaller towns you will some day encounter that the same name could actually be two different places in two different countries. (Try Munich on Google Maps and you get one in Germany and one in North Dakota.)
A somewhat complicated, but I guess future proof, workflow is to use id numbers in place of city names in your main data table. Then set up a locations table with those id numbers as primary keys and your preferred name of the city followed by as many meta data columns as required for country code, zip code, WGS84 coordinates, continent name, whatever. Add another table for city name synonyms, with just id numbers and names (without UNIQUE constraint on the id column).
Let your import script try to match the city with help from as many meta data as possible (probably different meta data from different providers), together with the Levehnstein algorithm you mentioned, and let it be clever enough to ask for human interaction in those cases where no one or more than one city are matched. It can of course show you the closest possible guesses, so you can pick the right one and have it stored in the synonym table.
(Yes, it is a lot of coding to get there. If you find it worth it or not depends on how often you do these updates.)
Tip: Wikipedia has articles with different names on cities, i.e. https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
What if you used an extra table for name translation?
IE, the table would have 2 columns: column A the name you use, column B, the name a provider uses. You might need to do adapt this table manually, to look like:
Bruxelles:Brussels
Bruxelles:Brussel
Bruxelles:Bruxelles
While importing, for the name of the city you would then use
select A where B = Brussels
In your agglomerated database, names would then be consistent.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How to create a fact table using natural keys

We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...

Advice on handling versioning of an EAV-model table

there are many ActiveRecord versioning gems available to Rails but most if not all of them are having trouble being maintained. on top of that, some of them seem to have various foreign key association issues.
I'm in the process of coding a content management system where pages are stored in a tree-like hierarchy and the page fields are stored in a separate table using EAV model.
keeping that in mind, I'm not looking for an all encompassing revisioning gem because I honestly don't think I'll find one. what I am looking for is some advice on how to handle this as a custom implementation. should I have a separate table for storing revisions and referring to a revision number in my EAV table? I foresee that this may lead to some complex validation problems though. I currently have a problem finding a clean way to validate a regular EAV table anyway so if anyone can comment on this it would be very much appreciated as well.
I hope this question is written well enough to SO standards. if you need any additional information, please do not hesitate to ask and I will try to help you help me. :)
I currently have a problem finding a clean way to validate a regular
EAV table anyway so if anyone can comment on this it would be very
much appreciated as well.
There isn't a clean way to either validate or constrain an EAV table. That's why DBAs call it an anti-pattern. (EAV starts on slide 16.) Bill doesn't talk about version, so I will.
Versioning looks simple, but it's not. To version a row, you can add a column. It doesn't really matter much whether it's a version number or a timestamp.
create table test (
test_id integer not null,
attr_ts timestamp not null default current_timestamp,
attr_name varchar(35) not null,
attr_value varchar(35) not null,
primary key (test_id, attr_ts, attr_name)
);
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_id', 1),
(1, 'emp_name', 'Alomar, Anton');
select * from test;
test_id attr_ts attr_name attr_value
--
1 2012-10-28 21:00:59.688436 emp_id 1
1 2012-10-28 21:00:59.688436 emp_name Alomar, Anton
Although it might not look like it on output, all those attribute values are varchar(35). There's no simple way for the dbms to prevent someone from entering 'wibble' as an emp_id. If you need type checking, you have to do it in application code. (And you have to keep sleep-deprived DBAs from using the command-line and GUI interfaces the dbms provides.)
With a normalized table, of course, you'd just declare emp_id to be of type integer.
With versioning, updating Anton's name becomes an insert.
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_name', 'Alomar, Antonio');
With versioning, selection is mildly complicated. You can use a view instead of a common table expression.
with current_values as (
select test_id, attr_name, max(attr_ts) cur_ver_ts
from test
-- You'll probably need an index on this pair of columns to get good performance.
group by test_id, attr_name
)
select t.test_id, t.attr_name, t.attr_value
from test t
inner join current_values c
on c.test_id = t.test_id
and c.attr_name = t.attr_name
and c.cur_ver_ts = t.attr_ts
test_id attr_name attr_value
--
1 emp_id 1
1 emp_name Alomar, Antonio
A normalized table of 1 million rows and 8 non-nullable columns has a million rows. A similar EAV table has 8 million rows. A versioned EAV table has 8 million rows, plus a row for every change to every value and every attribute name.
Storing a version number, and joining to a second table that contains the current values doesn't gain much, if anything at all. Every (traditional) insert would require inserts into two tables. What would be one row of 8 columns becomes 16 rows (8 in each of two tables.).
Selection is a little simpler, requiring only a join.

Resources