Suppose I have a rails tables containing information chosen from a set number of options. For example, a field named sex could be either Male or Female. A field named Bodytype will be either slim, curvy, etc.
My question is, what is better practice, to store those values as integers or strings?
In the first case, of course, the integers will be converted into text (In the controller?).
Thanks for all your help.
If are you are not storing millions of records, storing them as strings is fine. What you are loosing is ability to quickly update these categories, but if they are pretty much static and not going to change over time, it shouldn't matter. Sex should probably be called gender and Bodytype body_type.
You should always use an index to identify attributes within your tables.
So for example you tables will look like this
Gender Table
id | sex
1 | Female
2 | Male
Figure Table
id | body_type
1 | slim
2 | curvy
You then reference those values based on the id
http://use-the-index-luke.com/
Related
I have my Fact table with Policy data in it & I want to add Policy Products details to the warehouse.
One policy gets different types of products and the values also are dynamic.
Eg: Policy01 may have two products Building & Contents where sum insured values are 1000 & 500 respectively. And Policy02 get Building only of 750.
There are like 30 products available and I need to store sum insured value, gross & net premiums of each product per policy.
So if I add separate column for each product type into fact table it'll add live 120 more columns (currently there are 23 columns). Also max 5 products per policy so only 20 columns will contain values & others remain empty.
Is it ok to have 100+ columns for fact table? Is it ok to keep this many empty values in a row?
Or is there any other approach I can solve this?
I'm a novice at DWH and hope someone can shed me some light how to add these to my fact table.
One approach is to add a product dimension:
You can then return totals by policy:
SELECT
PolicyKey
SUM(PolicyProductValue) AS PolicyValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey
;
Or product:
SELECT
ProductKey,
SUM(PolicyProductValue) AS ProductValue
FROM
Fact.PolicyProductValue
GROUP BY
ProductKey
;
Or both:
SELECT
PolicyKey,
ProductKey,
SUM(PolicyProductValue) AS PolicyProductValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey,
ProductKey
;
This approach moves the products from the columns to the rows.
This technique offers several benefits:
It is easier to add new rows than columns.
You can add common filters to Dim.Product.
Dim.Product provides a location to create product hierarchies. Example:
| Product Key | Product Name | Product Group |
| ----------- | ------------ | --------------------|
| 0 | Building | Building & Contents |
| 1 | Contents | Building & Contents |
It's not ok to have 100+ columns in a fact table; it's a symptom of an incorrect data model (the same is true for missing values - a well designed fact table shouldn't have any).
The logic of the fact table design is the following:
First, deside on the table "granularity" - the most atomic level of data it will contain. In your case, data granularity is defined by Policy number + Product. Together they uniquely identify the most detailed information available to you.
Then, identify your "facts". Typically, facts are pieces of data that you can aggregate (sum, count, average, etc). In your case, they are Insured_Value, Gross_Premium, Net_Premium.
Finally, define business context for these facts (dimensions). In your case, they are Policy and Product (most likely, you will also have some kind of Date).
Your resulting fact table should look something like this:
Policy_Date
Policy_Number
Product_ID
Insured_Value
Gross_Premium
Net_Premium
Policy_Date will provide connection to "Calendar" dimension, Product_ID will connect to "Product" dimension (table that contains your 30 products and their descriptions).
Policy_Number is what's called a "Degenerate Dimension" - it's an ID that is usually not connected to any dimensions (but could if you need to). It's stored in a fact table just as a reference. Some people add "Policy" dimension to the model, but usually it's a design mistake - such dimensions are too "tall", comparable in size to the fact table, which can dramatically slow down your model performance. It's usually better to split policy attributes into multiple small dimesions and leave the policy number as a degenerate dimension.
So, your typical policy with 5 products will be represented as 5 records in the fact table, rather than one record with 5 fields. This is the critical difference - never, ever store information (products in your case) in the name of the fact table fields.
I have a nooby question regarding 1NF.
As I read from different sources a table is in the 1NF if it contains no repeating groups.
I understand this with the examples given online (usually with customers and contact names etc) but when it comes to my specific data I face difficulties.
I have the following fields:
ID TOW RECEIVER Phi01_L1 Phi01_L2 Phi01_L3
1 4353 gpo1 0.007 0.006 0.4
2 4353 gpo1 0.9 0.34 0.3
So, this table here is not in 1NF? How should it be in order to become?
What is Fist normal form (1NF)?
1NF- Disallows:
composite attributes
multivalued attributes
and nested relations; attributes whose values for an individual tuple are non-atomic
How to convert a relation into 1NF?
Two ways to convert into 1 NF:
Expand relation:
Increase number of colons in relation (as you did)
Increase rows and change Primary key value. (PK will include non-atomic attribute)
Hence your relation looks in 1-NF in present relation-state. and solution you made is expansion.
Break Relation:
Break relation into two relations -e.g. remove non-atomic col from base relation and create a new relation and add to new with PK.
Normal forms are best explain in Elmasri/Navath book
there are many ActiveRecord versioning gems available to Rails but most if not all of them are having trouble being maintained. on top of that, some of them seem to have various foreign key association issues.
I'm in the process of coding a content management system where pages are stored in a tree-like hierarchy and the page fields are stored in a separate table using EAV model.
keeping that in mind, I'm not looking for an all encompassing revisioning gem because I honestly don't think I'll find one. what I am looking for is some advice on how to handle this as a custom implementation. should I have a separate table for storing revisions and referring to a revision number in my EAV table? I foresee that this may lead to some complex validation problems though. I currently have a problem finding a clean way to validate a regular EAV table anyway so if anyone can comment on this it would be very much appreciated as well.
I hope this question is written well enough to SO standards. if you need any additional information, please do not hesitate to ask and I will try to help you help me. :)
I currently have a problem finding a clean way to validate a regular
EAV table anyway so if anyone can comment on this it would be very
much appreciated as well.
There isn't a clean way to either validate or constrain an EAV table. That's why DBAs call it an anti-pattern. (EAV starts on slide 16.) Bill doesn't talk about version, so I will.
Versioning looks simple, but it's not. To version a row, you can add a column. It doesn't really matter much whether it's a version number or a timestamp.
create table test (
test_id integer not null,
attr_ts timestamp not null default current_timestamp,
attr_name varchar(35) not null,
attr_value varchar(35) not null,
primary key (test_id, attr_ts, attr_name)
);
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_id', 1),
(1, 'emp_name', 'Alomar, Anton');
select * from test;
test_id attr_ts attr_name attr_value
--
1 2012-10-28 21:00:59.688436 emp_id 1
1 2012-10-28 21:00:59.688436 emp_name Alomar, Anton
Although it might not look like it on output, all those attribute values are varchar(35). There's no simple way for the dbms to prevent someone from entering 'wibble' as an emp_id. If you need type checking, you have to do it in application code. (And you have to keep sleep-deprived DBAs from using the command-line and GUI interfaces the dbms provides.)
With a normalized table, of course, you'd just declare emp_id to be of type integer.
With versioning, updating Anton's name becomes an insert.
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_name', 'Alomar, Antonio');
With versioning, selection is mildly complicated. You can use a view instead of a common table expression.
with current_values as (
select test_id, attr_name, max(attr_ts) cur_ver_ts
from test
-- You'll probably need an index on this pair of columns to get good performance.
group by test_id, attr_name
)
select t.test_id, t.attr_name, t.attr_value
from test t
inner join current_values c
on c.test_id = t.test_id
and c.attr_name = t.attr_name
and c.cur_ver_ts = t.attr_ts
test_id attr_name attr_value
--
1 emp_id 1
1 emp_name Alomar, Antonio
A normalized table of 1 million rows and 8 non-nullable columns has a million rows. A similar EAV table has 8 million rows. A versioned EAV table has 8 million rows, plus a row for every change to every value and every attribute name.
Storing a version number, and joining to a second table that contains the current values doesn't gain much, if anything at all. Every (traditional) insert would require inserts into two tables. What would be one row of 8 columns becomes 16 rows (8 in each of two tables.).
Selection is a little simpler, requiring only a join.
I have an app where shop owners can enter 10 zip codes in which they can provide services. Currently these zip codes are stored in single table column. Now what is the best and efficient way to do search based on this? Should I store all zip codes(all US zip codes) in a table and establish many to many relationship or do text search based on the current field using thinking sphinx?
A database guy's perspective . . .
Since you're talking about using Sphinx, I presume you store all 10 ZIP codes in a single row, like this.
shop_id zip_codes
--
167 22301, 22302, 22303, 22304, 22305, 22306, 22307, 22308, 22309, 22310
You'd be far better off storing them like this, for search and for several other reasons.
shop_id zip_codes
--
167 22301
167 22302
167 22303
167 22304
167 22305
167 22306
167 22307
167 22308
167 22309
167 22310
-- Example in SQL.
create table serviced_areas (
shop_id integer not null references shops (shop_id), -- Table "shops" not shown.
zip_code char(5) not null,
primary key (shop_id, zip_code)
);
You can make a good case for stopping after making this single change.
But you can increase data integrity substantially without making any other changes to your database if your dbms supports regular expressions. With that kind of dbms support, you can guarantee that the zip_code column contains only 5 integers, no letters. (There may be other ways to guarantee 5 integers and no letters.)
A table of ZIP codes would further increase data integrity. But you could easily argue that the shop owners have a vested interest in entering valid ZIP codes in the first place, and that this isn't worth more effort on your part. ZIP codes change pretty often; don't expect a "complete" table of ZIP codes to be accurate for very long. And you need to have a well-defined procedure for dealing with both new and expired ZIP codes.
-- Example in SQL
create table zip_codes (
zip_code char(5) primary key
);
create table serviced_areas (
shop_id integer not null references shops (shop_id),
zip_code char(5) not null references zip_codes (zip_code),
primary key (shop_id, zip_code)
);
You will need the zipcodes plus latitude/longitude in your db if you're using sphinx to do geospatial search (not really, you can use a text file or xml I suppose).
By geospatial search I mean something like "Find stores within 20 miles of your location"
For flexibility and efficiency, I would pick #1 ....
"store all zip codes in a table and establish many to many
relationship"
...with the assumption you also need to store other zip code data fields (City, State, county, Lat/Long, etc.). In that case your intersection would be: shop_id to zipcode_id(s). However, if you do not need/have extended ZIP Code data fields, then a single separate table with shop_id to the acutal zipcodes (not id) will be fine in my opinion.
In SQL you can describe a binary relation with a table like
Husband | Wife
We know that an husband can have only one wife, and viceversa, so that's a 1:1 relationship, and you can specify costraints such that if you add an husband that is already in the table you get an error, right?
If you add a third column like this
Husband | Wife | Country
We know that in some country one husband can have many wives; now you cannot put easy costraints, you have to deal with the third column.
So from a binary relation we get a ternary relation with different behavior that depends on the third column.
This example is stupid and useless, do you know any other example?
(other example of ternary relationship such that one of the column changes the tuple behavior?)
Thank you.
EDIT:
Another point of view to see my problem:
You have any binary relationship, within a domain: do you know any binary relation that changes costraints (or behavior) as domain changes?
Another example might be that you can apply coupons towards an order, but for certain coupon types you can only apply one per order whereas other coupon types may be combined.